IIS 7 and Above
Intermittent Server 500 Errors / FastCGI debugging /tracing
Last post Mar 12, 2010 11:54 AM by sethmay
Mar 04, 2010 03:27 PM|sethmay|LINK
I'm really hoping that folks might have some ideas for how I can gather more detailed information about these problems.
We've been experiencing a number of problems in our IIS 7 environment. We see two different things happening:
I've been working on these issues for several weeks, and I've done due diligence in trying to fix the problem through most of the resources at my disposal. I already tried opening an incident with Microsoft. They dropped me like a hot potato when they saw
I was using PHP.
We ran a previous version of our web app on this same machine from Jan, 2009 to Jan, 2010. We released our new version Feb 1, 2010 on this exact same machine. The only differences are MaxInstances was set to 10 and PHP did not have the SQLSRV & WinCache
extensions, and we were using PHP 5.2.8. The previous version of the application ran flawlessly environment.
I've run our current version of the application in PHP 5.2.8 without wincache (It cannot run without the sqlsrv driver) with the same results. I hope this rules out the wincache extension or some change in PHP between 5.2.8 and 5.2.12.
I've setup similar environments on other servers with the same results. To me, this is strong evidence that hardware is not the issues (network cards, memory, etc). In my isolated environments, I tested without the SSL certificate and still had the issues,
making me thing the SSL layer is not the problem.
Windows system log is clean (no related errors or warnings).
PHPs log file has nothing related (timestamps have no relation to the 500 errors as seen via Traces and IIS log files). Logging verbosely from the SQLSRV driver doesn't seem to turn up anything interesting.
IIS's Failed Request Tracing Rules and out IIS log files are the only way that I can see any evidence of the Server 500 errors the users are experiencing. These logs show a couple things:
Our most common trace file warning are (the first being far more common (and thus concerning) than the second):
I'm happy to pass full trace files on to anyone who wants a closer look. I get several thousand a day. I can find no useful information anywhere concerning the error (especially the first one).
I really need to find a way to pinpoint the issue. Is there a way to get more information out of IIS or FastCGI? Does anyone have any ideas of routes I have not taken? Should this go in a different forum?
500 - Internal server error
Mar 04, 2010 05:19 PM|ksingla|LINK
Error "network name is no longer available" is thrown when request which is about to get processed by FastCGI is already disconnected. FastCGI aborts the request with a 500. This was done to protect against F5 attack. So this error is not of concern. We need
to get to bottom of the second error you are seeing. I think something is causing php-cgi.exe processes to crash in rapid succession which is causing FastCGI to go in a failure protection mode. Once in failure protection mode, FastCGI will return 500 for a
minute before attempting to execute php-cgi again. Do you know if php-cgis are crashing on the server? Would you happen to have a full memory dump of the crash?
Mar 05, 2010 12:36 PM|sethmay|LINK
You've confirmed my suspicions concerning the "network name is no longer available". My load tests gave hints that these messages were related to severed connections.
I'm currently working to get a memory dump of the crashes. This is not something I've done before, so it is taking me a bit of time. I'm trying to use windbg to attach to the w3wp.exe process assigned to the relevant app pool. So far, I seem to lock up the
process (even in non-invasive mode). I'll consult with the other developers here to see if any of them have experience with this process.
I hope to have more data/info within a few hours.
Mar 05, 2010 08:30 PM|ksingla|LINK
http://support.microsoft.com/kb/286350 might be helpful in creating the crash dump.
Mar 12, 2010 11:54 AM|sethmay|LINK
Thanks for pointing me to that article. I set adplus to monitor php-cgi.exe in crash mode, as well as the w3wp process assigned to the relevant app pool.
It's generated a number of dump files off the php-cgi.exe, one of which is linked to the relevant "The I/O operation has been aborted because of either a thread exit or an application request. (0x800703e3)" error. Truthfully, I've looked through the log
and used Debug Diagnostic on the dump file, but I have not idea how to effectively interpret this information.