We have implemented a IHttpAsyncHandler based solution to service a large number of incoming requests from web clients. The handler in turn makes HttpWebRequests to various other web servers based on the incoming request.
For example:
if the request came from a client with an ID=1, we make the HttpWebRequest calls to, say, http://serverA.com?p1=v1&p2=v2
if the request came from a client with an ID=2, we make the HttpWebRequest calls to, say, http://serverB.com?p1=v1&p2=v2
We get several million requests a day to the service with various client IDs (1, 2, 3...) and frequently a large number of concurrent requests from a client with the *same* client ID i.e. we get about 100 concurrent requests with a client id of 1, so, we try
to make 100 HttpWebRequests to serverA.com
After the service runs for a while (a few hours) and successfully serves a couple of million requests, we notice the clients start getting 503 Service Unavailable errors even though the CPU is pretty idle.
We captured a memory dump (using DebugDiag) and the analysis showed groups of requests to the *same* server (as identified by the
fetch_url query param below) being "alive" in the ASP.NET pipeline. For example, here's a modified version of the capture (IP addresses/URLs edited for privacy reasons):
Client connection from XX.XX.XXX.XX:56300 to YY.YY.YYY.YY:80
Host Header www.myserver.com:80
POST request for /fetch.data?clientid=1&fetch_url=http://serverA.com?p1=v1&p2=v2
HTTP Version HTTP/1.1
SSL Request False Time alive 02:16:53
HTTP Request State HTR_READING_CLIENT_REQUEST
Native Request State NREQ_STATE_PROCESS
Client connection from XX.XX.XXX.XX:28319 to YY.YY.YYY.YY:80
Host Header www.myserver.com:80
POST request for /fetch.data?clientid=1&fetch_url=http://serverA.com?p11=v11&p22=v22
HTTP Version HTTP/1.1
SSL Request False Time alive 02:16:51
HTTP Request State HTR_READING_CLIENT_REQUEST
Native Request State NREQ_STATE_PROCESS
This indicates that these two requests came at more or less at the same time (based on the Time Alive value above). I've seen as many as 10 requests that get stuck in the queue at more or less the same time.
If I have 100 requests in the queue for the same server address and I have maxConnections set to 20, my understanding is that the framework queues the requests up until the previous requests are completed.
Any reasons why these connections can get "stucK' if they are coming at a rapid pace and trying to make requests to the same URL on the other side?
There are maximum amounts of connections above which things get unstable (cannot remember off the top of my head) and surely memory is equally important in this scenario for the amount of connections. I presume
that is all ok?
How many concurrent connections do you have? Is this when the problem cases occur?
Also look at the firewall. I have seen real problems with transferring very time sensitive/batched data/connections across and looking at the OS and IIS level and problem occur on the network level.
Compare with the IIS logs for happening around the actual problem cases? Although your might not be in there you can see what else is happening at the time. So you know when the first piece of sent to the server
(time - timetaken in milliseconds) and when the last one was sent out. Any taking a long time when the request was sent? etc
I rip out all the iis data logs for a time period and place it in excel or something so you can graphically see the patterns of timetaken, error frequency over time etc. You might be able to notice patterns. Maybe
many grouping on long times like 10 seconds, on the hour, problems only occur minimum of 60 minutes apart (maybe it recycles something and the 1st request always fails and all the rest are fine - then the app pool recycles and the 1st
one fails again.)
Compare this data with the times from your firewall.
Maybe use a network monitoring tool to confirm traffic, times, etc.
Obviously a lot of speculation there and I am drawing on my experiences of troubleshooting previous problem like this. The most import thing is to keep an
open mind and gather as much information as you can then things will slot into place.
1. I don't see anything abnormal in the httperr logs - mostly TimerConnection_Idle and an occasional Timer_MinBytesPerSecond messages
2. Clients start receiving 503s when the perfmon counter "ASP.NET\Requests Current" shows a value of 5000 (due to the "stuck" requests)
3. Memory and other system parameters are OK as observed in perfmon
4. Please note that for each incoming request, we make an outgoing request. We have persistent connections open on both ends (to the load balancer on one side and to many servers on the other)
5. We currently have the "maxConnection" param set to 20. However, we can get as many as a few hundred requests for the same client ID i.e. this will result in an attempt to make HttpWebRequests to the same host that many number of times.
6. I've noticed that bumping up the "maxConnection" to 48 (we're on a quad CPU box) makes things a little bit better. Requests still get stuck in the app queue, but, not at the same rate.
Looks like there's a direct corelation between the number of requests to the *same* host and the maxConnection param. The question is, why are some getting stuck since if there are more requests than available connections they should be queued up and exceuted
as connections become available.
I will have to give it some thought. I get confused with all the possible options and the theory behind very high volume websites and what the real world needs.Interesting
about nothing in the httperr log.
That would be too easy wouldn't it.
So what can we conclude from that that? the http.sys level is all ok and the bottleneck is occurring further down the stack?I
think that is a safe bet.A
lso no httpsys errors then there is IIS log information?
So what do you have for the 503. what sub code? win32error code?
(I am still surprised how useful win32 is for troubleshooting sometimes giving that extra little piece of information to help you solve a problem).
I imagine it might be a 64 error for win32 a timeout (The specified network name is no longer available). Ummh. What does that say about the situation?
Is all this on one box? Do you have multi-tier?
e.g
Web front end tier
+Application tier
+Database tier.Confirm which tier the request and/or error are bottlenecking/erroring. Lets make
sure you are concentrating you efforts in the right area.If I rememeber correctly this was a useful guide when I have referred to it previously it is .NET 1.1 though
http://www.eggheadcafe.com/articles/20050613.aspOther
tuning stuff to think about. Are you keeping the connections alive too long and many connections build up. Do they need to be? If CPU is low and you don't mind it reducing the keep-alives (or probably
better for real world environments) reduce this to say 30 seconds. I don't know enough about your setup but consider connections open for an
optimum realistic timeframe. Reducing the connections (as try looking at various caches, etc) could help your high volume requests through which I also presume are small.If you don't mind the extra
CPU/memory cost then maybe you can change the many parameters in IIS /.NET to get the best outcome. It is all about balance. More
thoughts later. I hope there are useful to you.
Controls the max port number that TCP can assign. Every unique client making a request to your web server will use up at least one of these ports on the server. Web applications on the server making outbound SQL or SMB connections also use up these ports
on the server... so it highly affects the number of concurrent connections. For SMB tuning, read the
de-facto IIS6 and UNC Whitepaper.
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters Type: DWORD Value: Range from 5000 to 65536"
By default, Microsoft Windows Server 2003 sets the MaxUserPort value to 5000 so if you haven't changed that that is the limit of the total TCP/IP connections to your server.
2 Posts
Some requests get "stuck" in the app queue over time resulting in 503s..
Jun 05, 2008 01:19 AM|cbarnard|LINK
Hi,
We have implemented a IHttpAsyncHandler based solution to service a large number of incoming requests from web clients. The handler in turn makes HttpWebRequests to various other web servers based on the incoming request.
For example:
if the request came from a client with an ID=1, we make the HttpWebRequest calls to, say, http://serverA.com?p1=v1&p2=v2
if the request came from a client with an ID=2, we make the HttpWebRequest calls to, say, http://serverB.com?p1=v1&p2=v2
We get several million requests a day to the service with various client IDs (1, 2, 3...) and frequently a large number of concurrent requests from a client with the *same* client ID i.e. we get about 100 concurrent requests with a client id of 1, so, we try to make 100 HttpWebRequests to serverA.com
After the service runs for a while (a few hours) and successfully serves a couple of million requests, we notice the clients start getting 503 Service Unavailable errors even though the CPU is pretty idle.
We captured a memory dump (using DebugDiag) and the analysis showed groups of requests to the *same* server (as identified by the fetch_url query param below) being "alive" in the ASP.NET pipeline. For example, here's a modified version of the capture (IP addresses/URLs edited for privacy reasons):
Client connection from XX.XX.XXX.XX:56300 to YY.YY.YYY.YY:80
Host Header www.myserver.com:80
POST request for /fetch.data?clientid=1&fetch_url=http://serverA.com?p1=v1&p2=v2
HTTP Version HTTP/1.1
SSL Request False
Time alive 02:16:53
HTTP Request State HTR_READING_CLIENT_REQUEST
Native Request State NREQ_STATE_PROCESS
Client connection from XX.XX.XXX.XX:28319 to YY.YY.YYY.YY:80
Host Header www.myserver.com:80
POST request for /fetch.data?clientid=1&fetch_url=http://serverA.com?p11=v11&p22=v22
HTTP Version HTTP/1.1
SSL Request False
Time alive 02:16:51
HTTP Request State HTR_READING_CLIENT_REQUEST
Native Request State NREQ_STATE_PROCESS
This indicates that these two requests came at more or less at the same time (based on the Time Alive value above). I've seen as many as 10 requests that get stuck in the queue at more or less the same time.
If I have 100 requests in the queue for the same server address and I have maxConnections set to 20, my understanding is that the framework queues the requests up until the previous requests are completed.
Any reasons why these connections can get "stucK' if they are coming at a rapid pace and trying to make requests to the same URL on the other side?
Thanks
5280 Posts
MVP
Moderator
Re: Some requests get "stuck" in the app queue over time resulting in 503s..
Jun 05, 2008 07:30 AM|Rovastar|LINK
Maybe use a network monitoring tool to confirm traffic, times, etc.
Obviously a lot of speculation there and I am drawing on my experiences of troubleshooting previous problem like this. The most import thing is to keep an open mind and gather as much information as you can then things will slot into place.https://www.leansentry.com/
2 Posts
Re: Some requests get "stuck" in the app queue over time resulting in 503s..
Jun 05, 2008 10:37 AM|cbarnard|LINK
Thanks Rovastar for your tips.
Here's some more details:
1. I don't see anything abnormal in the httperr logs - mostly TimerConnection_Idle and an occasional Timer_MinBytesPerSecond messages
2. Clients start receiving 503s when the perfmon counter "ASP.NET\Requests Current" shows a value of 5000 (due to the "stuck" requests)
3. Memory and other system parameters are OK as observed in perfmon
4. Please note that for each incoming request, we make an outgoing request. We have persistent connections open on both ends (to the load balancer on one side and to many servers on the other)
5. We currently have the "maxConnection" param set to 20. However, we can get as many as a few hundred requests for the same client ID i.e. this will result in an attempt to make HttpWebRequests to the same host that many number of times.
6. I've noticed that bumping up the "maxConnection" to 48 (we're on a quad CPU box) makes things a little bit better. Requests still get stuck in the app queue, but, not at the same rate.
Looks like there's a direct corelation between the number of requests to the *same* host and the maxConnection param. The question is, why are some getting stuck since if there are more requests than available connections they should be queued up and exceuted as connections become available.
Thanks
5280 Posts
MVP
Moderator
Re: Some requests get "stuck" in the app queue over time resulting in 503s..
Jun 05, 2008 11:36 AM|Rovastar|LINK
That would be too easy wouldn't it.
So what can we conclude from that that? the http.sys level is all ok and the bottleneck is occurring further down the stack?I think that is a safe bet.Also no httpsys errors then there is IIS log information?
So what do you have for the 503. what sub code? win32error code?(I am still surprised how useful win32 is for troubleshooting sometimes giving that extra little piece of information to help you solve a problem).
I imagine it might be a 64 error for win32 a timeout (The specified network name is no longer available). Ummh. What does that say about the situation?
Is all this on one box? Do you have multi-tier?
e.gWeb front end tier
+Application tier
+Database tier.Confirm which tier the request and/or error are bottlenecking/erroring. Lets make sure you are concentrating you efforts in the right area.If I rememeber correctly this was a useful guide when I have referred to it previously it is .NET 1.1 though http://www.eggheadcafe.com/articles/20050613.aspOther tuning stuff to think about. Are you keeping the connections alive too long and many connections build up. Do they need to be? If CPU is low and you don't mind it reducing the keep-alives (or probably better for real world environments) reduce this to say 30 seconds. I don't know enough about your setup but consider connections open for an optimum realistic timeframe. Reducing the connections (as try looking at various caches, etc) could help your high volume requests through which I also presume are small.If you don't mind the extra CPU/memory cost then maybe you can change the many parameters in IIS /.NET to get the best outcome. It is all about balance. More thoughts later. I hope there are useful to you.https://www.leansentry.com/
5280 Posts
MVP
Moderator
Re: Some requests get "stuck" in the app queue over time resulting in 503s..
Jun 05, 2008 11:49 AM|Rovastar|LINK
I did as little more digging I know I had seen your magic numebr of 5000 somewhere regarding connections but could not remember where but:
"MaxUserPort (TCPIP.SYS)
Controls the max port number that TCP can assign. Every unique client making a request to your web server will use up at least one of these ports on the server. Web applications on the server making outbound SQL or SMB connections also use up these ports on the server... so it highly affects the number of concurrent connections. For SMB tuning, read the de-facto IIS6 and UNC Whitepaper.
http://blogs.msdn.com/david.wang/archive/2006/04/12/HOWTO-Maximize-the-Number-of-Concurrent-Connections-to-IIS6.aspx
By default, Microsoft Windows Server 2003 sets the MaxUserPort value to 5000 so if you haven't changed that that is the limit of the total TCP/IP connections to your server.
http://technet.microsoft.com/en-us/library/bb397382(EXCHG.80).aspx
https://www.leansentry.com/