We are excited to announce that the IIS.NET Forums are moving to the new Microsoft Q&A experience. Learn more >

IIS / ARR stops respondingRSS

17 replies

Last post Jun 17, 2010 06:51 AM by Mark S. Rasmussen

  • IIS / ARR stops responding

    May 26, 2010 04:05 PM|Mark S. Rasmussen|LINK

    We've been having some serious issues with our IIS ARR load balancer machines and I'm currently on the brink of dropping them alltogether, so I really hope someone can give me a clue on how to debug our issues.

     

    We have three content servers. Initially we had a single IIS ARR load balancer in front of them, running at about 40% CPU load, everything was good. Once in a while it'll completely stop responding, all outside requests just hang without receiving a response, despite the load balancer being up and all content servers responding normally. A recycle does not help, neither does an iisreset. A reboot of the machine does seem to help, but sometimes it doesn't start responding immediately upon reboot, might require an iisreset even after reboot before it starts responding again.

     

    To alleviate this we've setup a completely new IIS ARR load balancer machine, as well as reformatted the old one so we now have to completely identical load balancer machines - Dell PE1950's, Windows Web Server 2008 R2 X64, 100% clean install only including IIS + related features and NLB. The load balancers run in an NLB setup with no affinity (using IIS ARR client affinity). This works perfectly, right up untill the point at which both machines stop responding to requests at the same time.

     

    I've installed Wireshark on both machines and by using that I've verified that both load balancers are indeed receiving the incoming HTTP traffic while they're not responding, so it does not seem to be switch / related to the hardware network stack. If I request 127.0.0.1 or one of our domains (mapped to 127.0.0.1 using the hosts file) on the load balancers themselves, I receive a perfect reply - but if I request the same on one of the other machines on the LAN, no reply is received. I can ping the load balancers just perfectly.

     

    It seems like the HTTP traffic is received on the machine, but somehow it's not piped onto the IIS process. FREB does not show any logs at all, neither does the native IIS logs - both show requests if I make a local request.

     

    If only one load balancer had this issue, I might brush it off as a weird occurrence. But these are two brand newly formatted machines, fully patched, newest OS and running ARR 2.0 x64. Both machines stop responding at the exact same time which I find very odd as that suggests some external issue - but Wireshark seems to suggest it's local to the machines. The issue also happened before NLB so I do not suspect NLB has any effect on the issue. To make it even more weird - when both machines stop responding and iisreset's does not help, I'll reboot one of the machines and 5 mins later when it's rebooted, both machines will start responding again (including the one that I did not reboot!).

     

    When in NLB both load balancers are using about 20% CPU (quad core), pushing about 250 requests/sec each. They're in no way under heavy load, but they're not standing idle either.

     

    If anyone has ANY suggestions on how to debug this issue, I'd be thrilled. At this point I'm very close to ditching IIS ARR alltogether and going back to a pure NLB setup on the content servers directly - but I'd really prefer not to.

     

    Thanks,
    Mark S. Rasmussen

  • Re: IIS / ARR stops responding

    May 26, 2010 06:34 PM|richma|LINK

    Hi Mark,

    The symptoms seem to imply that the servers are experience resource issues except for the fact that rebooting one resolves the issues on both. But the fact that the IIS log is not logging anything in the IIS pipeline this certainly does seem like a resource issues. Here are some items I would check .

     --The HttpSys logs for any errors logged during the time frame of the outage: C:\Windows\System32\LogFiles\HTTPERR

    -- Event logs : I would look at these for any events that indicate issues with system memory  such as .  Or anything elese

    Event ID 2019
    Event Type: Error
    Event Source: Srv
    Event Category: None
    Event ID: 2019
    Description: The server was unable to allocate from the system NonPaged pool because the pool was empty.

    Paged Pool Depletion is logged as an Event 2020:

    Event ID 2020
    Event Type: Error
    Event Source: Srv
    Event Category: None
    Event ID: 2020
    Description: The server was unable to allocate from the system paged pool because the pool was empty.

     -- Run Netstat  to see if there are an Port exhaustion issues. The local requests might be bypassing this if it were occurring.

    http://blogs.technet.com/b/askds/archive/2008/10/29/port-exhaustion-and-you-or-why-the-netstat-tool-is-your-friend.aspx 

    -- Start  gethering data with Performance monitor to look for patterns leading up to the issue. I would start with the following counters.

    • Arr:http://learn.iis.net/page.aspx/691/using-performance-counters/
    • Memory  >. Looking for spikes here
    • Web Service > suspect this will validate a drop in Requests
    • <div style="MARGIN: 0in; FONT-FAMILY: Calibri; FONT-SIZE: 11pt">Http Service > Help find if there are issues in HTTP Sys layer</div>
    • <div style="MARGIN: 0in; FONT-FAMILY: Calibri; FONT-SIZE: 11pt">Http Service Request Queues</div>
    • <div style="MARGIN: 0in; FONT-FAMILY: Calibri; FONT-SIZE: 11pt">Process > help find any rougue processes hogging CPU or anything.</div>

     

    Could you also describe how you are using cache both the Kernel Memory cache and ARR disk cache if being used.

    I have seen an issue where the scavenging time was increased for httpsys and this was causing memory issues.

    See  URiScavengingPeriod    http://support.microsoft.com/kb/820129

     

    Let us know if you have further questions

     

    Thanks

     

    Richard

     

  • Re: IIS / ARR stops responding

    May 27, 2010 02:22 AM|Mark S. Rasmussen|LINK

    Hi Richard,

    Thanks for your very valuable input!

    Let me start out with the simple issues

    • <div mce_keep="true">Event log is clean, there's nothing indicating any issues here.</div>
    • <div mce_keep="true">The machines have 8GB's of memory and always 5+GB free.</div>
    • <div mce_keep="true">Caching is disabled in ARR and kernel cache duration is set to 0 seconds (see http://forums.iis.net/p/1165459/1935831.aspx#1935831 for why I've disabled caching).</div>
    • <div mce_keep="true">I will try and run a netstat next time the issue occurs.</div>

    I did find some correlated entries in the HTTPSYS logs during our last outage last night! I've previously missed these as it seems the HTTPSYS logs are not following the server clock, perhaps running in GMT time I assume.

    The HTTPSYS logs have lots of Timer_ConnectionIdle messages which should be perfectly normal. However, at the exact time of our outage, I first get a couple of these:

    Connection_Dropped_List_Full

    5 seconds later I get about 20 Connection_Dropped within the same second.

    For the next 30 seconds I get a series of Timer_ConnectionIdle followed by two Timer_MinBytesPerSecond and then the pattern repeats with two Connection_Dropped_List_Full's and then about 20 Connection_Dropped. Shortly after all hell breaks loose and it seems all requests are failing with either Connection_Dropped_List_Full, Connection_Dropped, Connection_Abandoned_By_ReqQueue or "503 Service unavailable". This continues right untill the point where I reboot the machine, after which it goes back to the usual Timer_ConnectionIdle messages.

    These patterns are exactly the same for both servers around the same time!

    Googling the error codes suggest they may happen due to the worker processes failing, but they're all running perfectly and there's no entries in the event logs. By getting a reply to a local request, it seems to suggest it's not the worker processes that are failing.

    Let me know if you have any further suggestions while I'm googling to find any further info on the error codes & possible causes.

    Thanks,
    Mark S. Rasmussen

  • Re: IIS / ARR stops responding

    May 27, 2010 03:19 AM|Mark S. Rasmussen|LINK

    I definitely think the resource leak theory is likely the cause. This might also explain why both machines apparently go down at the same time - NLB distributes the load evenly and without affinity. Since both machines are 100% identical, they should technically experience the issue at the same time if the load is distributed 50/50.

    I'm going to set the NLB distribution to 75/25 as this should confirm whether it's a resource leak that's affecting both machines at the same time - setting it to 75/25 should mean one machine would go down while the other will stay up - both confirming the issue and increasing my availability in the short term, untill a cause/solution has been found. I'll report back if/when the issue occurs again and whether the NLB distribution change has any affect.

    Thanks,
    Mark

  • Re: IIS / ARR stops responding

    May 27, 2010 11:56 AM|richma|LINK

    Mark,

    I am looking into the  HTTPerr entries, Can you tell me what OS you are on 2008 or 2008R2 ?

     Thanks

    Richard

  • Re: IIS / ARR stops responding

    May 27, 2010 12:09 PM|Mark S. Rasmussen|LINK

    Hi Richard,

    Both machines are running a fully patched Windows Web Server 2008 R2 x64. Both machines were freshly formatted & installed last week.

    If it's any help, I can send you the log that details the outage including all events before, under and after the actual outage.

    Thanks,
    Mark S. Rasmussen

  • Re: IIS / ARR stops responding

    Jun 02, 2010 05:07 AM|Mark S. Rasmussen|LINK

    We just had the same issue again. Both load balancers stopped responding at the same time, same pattern as last time. Changing the NLB distribution did not have any effect - that's one theory down the drain. Looking through perf counters now for clues.

  • Re: IIS / ARR stops responding

    Jun 02, 2010 10:07 AM|richma|LINK

     

    Mark , If you would like to forward any of the data for review please send to webpi at microsoft.com.

    Thanks

    Richard

  • Re: IIS / ARR stops responding

    Jun 04, 2010 04:18 AM|Mark S. Rasmussen|LINK

    I believe I've found the core reason, and to my delight, SELECT does not seem to be broken.

    By wading through our IIS logs, HTTP.sys logs and application logs, I've pinpointed a range of IP adresses that has been abusing our system.

    We have very long execution timeouts on both our content servers and on our ARR servers since we need to serve large PDF files (200MB+) to our visitors. Some of these visitors will have very slow connections and can thus take upwards of an hour to complete a request.

    What I've seen in our application logs is that this certain range of IP adresses have made a large number of requests for PDF files - each consuming about 400KB after which they just hang and don't accept further data - though the IIS keeps the connection open and tries to send data. As a result, these stale connections have quickly filled up our request queue, causing the 503's and HTTP.sys log entries.

    According to my perfmon logs (from one of the ARR machines) the request queue peaked at about 1200 before I restarted the HTTP service last time - similar data values on the secondary ARR machine. I'm not sure why I would get the 503's at this point as my requet queue limit is set to 10.000 in the application pool setting.

    I'm contemplating how to protected ourselves from a similar occurences. For the short term I've blocked the abusing IP range on the ARR machines. Long term I'll be adding some application code to limit the number of concurrent non-complete file downloads from a specific IP, but this attack could be performed on a 5kb gif it would seem. If I am to reduce the execution timeout, I'll have to setup separate sites for file download and application code, limiting the attack to only the file download site. Assuming I don't get 503's before I hit the 10k limit, I could technically increase this limit to 25k as none of the stale connections take up any valuable resources besides a connection - and being on 2008 R2 x64 with plenty of memory, it should be able to handle 25k connections. At the same time I'd have to increase the concurrent execution count so the valid requests in the request queue get handled.

    Say I request a 300MB file from the ARR machine on a 56k connection - will the ARR machine fetch the 300MB file from the content server immediately or will it basically forward the content server stream to the client - potentially clogging up not only the ARR machine but the content server too? I'm assuming it'll forward the stream to avoid excessive memory load - in which case I'll have to take protective measures on both the ARR and the content servers.

    Basically it seems my best bet is to increase logging and make sure to quickly detect attacks like these and then block their IP's. Any other suggestions are of course welcome :)

    Thanks,
    Mark

  • Re: IIS / ARR stops responding

    Jun 04, 2010 11:17 AM|richma|LINK

    Mark,

    When you refer to stale connections are you seeing this at the TCP layer (Netstat output) where alot of connections are in TIME_WAIT.  If so this may be affecting ARRs ability to make outbound connections and contribute to your queuing.

    You can look at tuning  the  MaxUserPort and TcpTimedWaitDelay settings at the TCP layer.

     http://blogs.technet.com/b/tristank/archive/2008/03/11/maxuserport-what-it-is-what-it-does-when-it-s-important.aspx

     Richard

  • Re: IIS / ARR stops responding

    Jun 07, 2010 05:23 AM|Mark S. Rasmussen|LINK

    I don't think it's at the TCP layer, though I will need to check the next time it happens. During normal peak load we have around 3-4000 incoming connections, some being in a TIME_WAIT state while most of them are ESTABLISHED (about 90%, quick guess). Almost all of the TIME_WAIT connections seem to come from proxy servers. At the same time we have about 250 outgoing connections to the content servers, about 99% in the ESTABLISHED state.

    As I understand it, MaxUserPort & TcpTimedWaitDelay are only relevant to the outbound connections and thus a stale incoming connection shouldn't affect the ability to make outgoing connections - please correct me if I'm wrong on this. Either way, it does not seem like we're exhausted of outgoing connections.

    I've included a snapshot of some perfmon data during an outage (raw data at http://improve.dk/IMPERIAL_ARR.blg):

    It clearly shows the event at around 11:01. Things to notice:

    • <div mce_keep="true">Requests/Sec drops</div>
    • <div mce_keep="true">Requests Executing increases</div>
    • <div mce_keep="true">Requests Current increases</div>
    • <div mce_keep="true">Requests Disconnected rises dramatically</div>
    • <div mce_keep="true">Requests Queued increases to about 1200 before an IISreset clears it</div>
    • <div mce_keep="true">Request Execution Time increases from near nothing to much higher values</div>

    It seems my malicious IP theory might not have been the cause after all as we just had another outage and so far my logs don't show anything similar as the last times - red herring it seems. Again both ARR machines went down at the same time so the problem seems to be identical on both servers, even with a 60/40% NLB distribution without affinity.

    On the next outage I'll make sure to log a netstat -n so I've got that data available as well. If there's anything else I can log / detail about our setup, please let me know.

  • Re: IIS / ARR stops responding

    Jun 07, 2010 11:03 AM|richma|LINK

    My theory was that possible issues at theTCP layer were causing problems with ARR's ability to make outbound connections , tying up threads in IIS and thus hurting IIS ability to accept new connections and subsequently proxy.

    Given that  we now know that Requests Executing  + Queueing increases I would be curious to see whats going on in IIS via memory dump analysis of the IIS worker processes . Gathering the dumps can be done from Task Manager  ( Right click on any instance of W3wp and "Create Dump File" ) . We dont have any automated tools for the analysis this so this depends on your level of expertise at debugging.  If you go that route I can try to assist.

    Richard

  • Re: IIS / ARR stops responding

    Jun 08, 2010 05:15 AM|Mark S. Rasmussen|LINK

    Thanks for hanging in there.

    On the next outage I'll get the following data, besides the performance counter logs:
    netstat -n -t > "D:\netstat -n -t.txt"
    appcmd list apppools > "D:\appcmd list apppools.txt"
    appcmd list wps > "D:\appcmd list wps.txt"
    appcmd list requests > "D:\appcmd list requests.txt"

    Furthermore I'll get crash dumps of all worker processes (running 4 wp's) and take a peek at them, I assume a look at the managed thread stack traces would be a good start.

    Thanks,
    Mark

  • Re: IIS / ARR stops responding

    Jun 14, 2010 08:35 AM|Mark S. Rasmussen|LINK

    I'm back again with some more info from our last incident.

    netstat -n -t
    The amount of connections is the same as usual. However, there's a number (5% or so) of them that are in the CLOSE_WAIT and LAST_ACK state - usually there's none in these states during normal load.

    appcmd list apppools
    APPPOOL "DefaultAppPool" (MgdVersion:v2.0,MgdMode:Integrated,state:Started)

    appcmd list wps
    WP "3952" (applicationPool:DefaultAppPool)
    WP "3560" (applicationPool:DefaultAppPool)
    WP "3748" (applicationPool:DefaultAppPool)
    WP "3828" (applicationPool:DefaultAppPool)

    appcmd list requests
    This one is much more interesting. The data looks similar for both load balancer machines (both went down simultaneously as usual).

    First off, of the ~1000 requests currently executing, 25% of those are like the following:

    REQUEST "4800060080002c9d" (http://forums.iis.net/GET /XYZ/, time:94303 msec, client:111.222.333.444, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
    REQUEST "6200060080007f27" (
    http://forums.iis.net/GET /XYZ/, time:93663 msec, client:111.222.333.444, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
    REQUEST "f500070080006247" (
    http://forums.iis.net/GET /XYZ/, time:91994 msec, client:111.222.333.444, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
    REQUEST "8e00000380006247" (
    http://forums.iis.net/GET /XYZ/, time:86784 msec, client:111.222.333.444, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
    REQUEST "7400000280005353" (
    http://forums.iis.net/GET /XYZ/, time:86612 msec, client:111.222.333.444, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)

    Note that I've masked the URL and IP addresses to protected the potentially innocent. Of the 25% requests stemming from the same IP, they're distributed 50%/50% on two different URLs. The URL's requested are valid and usually return within milliseconds so there's no reason for them to be hanging for 80-90secs like the above requests. Performing an IP lookup reveals that it belongs to a Danish cellphone provider.

    However, that is not the only issue of interest. Of the ~1000 requests, 850 of those are stuck in the ArrCookieRestorer module:
    REQUEST "c000010080009ea9" (http://forums.iis.net/GET /FlashAssets/Fonts/Latin.swf, time:68064 msec, client:999.888.777.666, stage:BeginRequest, module:ArrCookieRestorer)

    Aha! That's my code so naturally I must be the one to blame. The ArrCookieRestorer module code is very simple, pasted in full here (any way to paste formatted code on these boards?):
    http://pastebin.com/cNNPehMH

    I fail to see how that code can hang. I've explicitly used the catchall exception handler as I don't want my load balancers to return an error in case I messed up my code.

    I did uncover an issue in my registration of the module though. While my module is meant to run before the ApplicationRequestRouting module, mine was registered at the end of the list - thus having no effect at all. That leads me to the theory that it might not be my code hanging after all, but simply the IIS hanging in the last module in the pipeline - which my module is, even though it should've been ApplicationRequestRouting, had my module been registered in the correct order).

    Next step
    Currently it seems my issue is either caused by the large number of hanging requests stemming from the single IP, my module hanging, or a combination thereof. As a short-term fix I'll setup centralized session state so I can disable the ARR affinity and thus do without my ArrCookieRestorer module (please see http://improve.dk/blog/2009/12/09/fixing-flash-bugs-by-intercepting-iis-application-request-routing-cookies for a full explanation of why I needed the module in the first place).

    Removing the module should either fix the problem, in case it's solely caused by the module, or at least give me cleaner debug data the next time it happens. Unfortunately I did not get any crash dumps, I'll try and get those the next time.

    Please let me know if any of the above makes sense to you - I'm very interested in why 80% of the requests are hanging in my seemingly harmless module.

    Thanks,
    Mark

  • Re: IIS / ARR stops responding

    Jun 15, 2010 01:33 AM|CarlosAg|LINK

    Out of curiosity, could you remove the use of HttpContext.Current, that actually incurs in a lookup in a HashTable that is going to make things slower and I'm wondering if contention could be an issue.

    I believe you should be able to just do:


    private void context_BeginRequest(object sender, EventArgs e)
    {
        try
        {
            HttpContext context = ((HttpApplication)sender).Context;

     

    The other thing I would recommend is caching a the regular expression that you are using, and potentially get rid of Regex altogether since the code seems pretty simple and I've seen pretty bad issues with RegEx that matched complex input that could lead to significant performance issues and in many cases, specially crafted content can lead to "what could look like almost infinite loops".

     

  • Re: IIS / ARR stops responding

    Jun 15, 2010 03:25 PM|Mark S. Rasmussen|LINK

    Hi Carlos,

    I'll try and setup a synthetic test in an EC2 environment to see if I can reproduce the bug. As I've removed the module from our production systems I'm disinclined to deploy a changed module. If the issue is based on contention I should be able to reproduce it easily in a synthetic environment as I can push a lot more load on that one that we're experiencing on our production systems.

    Both your suggestions are good though none of them would explain why both load balancers stop responding at the same time - with 0% CPU usage.

    I'll try and get the testing up and report back what I find out.

    Thanks,
    Mark

  • Re: IIS / ARR stops responding

    Jun 16, 2010 02:05 PM|richma|LINK

     Hi Mark,

    Answering an earlier question from you.

    Mark S. Rasmussen

    Say I request a 300MB file from the ARR machine on a 56k connection - will the ARR machine fetch the 300MB file from the content server immediately or will it basically forward the content server stream to the client - potentially clogging up not only the ARR machine but the content server too? I'm assuming it'll forward the stream to avoid excessive memory load - in which case I'll have to take protective measures on both the ARR and the content servers.

    Regarding, how much data ARR fetches from the backend, it will fetch as much as what the buffer allows.   You would increase the size of the buffer to download “a lot” of data from the content/backend servers and let the data flow “slowly” to the clients – but now you are moving connection/memory pressure from the content/backend to ARR (which we may not recommend since there may be fewer ARR servers  and many more of content/backend servers.)   

    The settings are   minResponseBuffer and responseBufferLimit.

    The responseBufferLimit is the size of how much buffer to keep for each request (ie. ARR can download as quickly as it can from the content servers up-to this limit.)   The minResponseBuffer is when ARR actually sends it to the client.   ARR doesn’t “stream” the data.  It buffers it and sends it to client in “chunks”.   In order to truly stream, you will set the minresponsebuffer to 0.   The downside of doing this would be that you can not use the  kernel cache benefit (as kernel cache doesn’t do fragmented caching).

     

    Curious on how you are blocking IP's  are you using the IP restriction feature ?

    At this point I think our best understanding of the problem is going to come from the dump of the worker processes which leads to another question.

    It appears you are using Web Gardens for your worker processes . Is there a reason you configured these ?

  • Re: IIS / ARR stops responding

    Jun 17, 2010 06:51 AM|Mark S. Rasmussen|LINK

    Hi Richard,

    Thanks for clearing that out. Right now our ARR servers are using abot 20MB per WP out of 4GB available, so I'd assume I can safely relieve a bit of contention on the content servers by increasing the responseBufferLimit and thus use some of the ARR memory I've got plenty of. Technically we could do with streaming since I've disabled the disk & kernel cache (http://forums.iis.net/p/1165459/1935831.aspx#1935831).

    We're blocking IP's using the IP Address and Domain Restrictions module. We're only blocking about 5 smaller ranges at the moment so it works well enough for us. I'd prefer to do it at a firewall level but unfortunately we don't have direct control of it so it's easier for us to block the IP's at the ARR level.

    Regarding us running as a web garden - we ran the 1.0 RC for a long time, we upgraded to 2.0 RTW near the end of april this year. While running on the RC we found that matching the number of WP's to our number of cores resulted in a more evenly distributed workload on the cores and generally seemed to run more stable. Since switching to 2.0 RTW I haven't experimented with the number of worker processes. Though it's no problem currently, the load seems rather uneven (running 4 WP's). This is a screenshot of taskman of one of our ARR machines (running two in an NLB setup), while it's pushing out about 200Mbit/s of data on average:

     

    CPU times on the cores are 02:34, 06:33, 00:41 and 04:41 since the last reboot.

     - Mark