Hi,
I've added a third content server to our current ARR cluster consisting of a dedicated ARR machine as well as two dedicated content servers behind it.
When I enable the new server, rather quickly the number of current requests grows up to the point where it doesn't receive any traffic anymore due to using lowest-current-requests load balancing.
If I run "appcmd list request" on the ARR machine, there's a number of hints of the outstanding requests. A lot of the requests are basically for static files (we have some code running on them as well, but it's very basic and usually runs in no time), but some are for normal app files as well. Usually we have an avg. response time of about 30ms, so the time values are certainly not usual.
REQUEST "190000408000f700" (url:GET /Modules/FlipMag.swf, time:1019061 msec, client:85.200.96.170, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
REQUEST "a00000c08000c6cb" (url:GET /Modules/FlipMag.swf, time:1016924 msec, client:80.63.228.160, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
REQUEST "8800004180003a2c" (url:GET /Modules/ClassicShopInterface.swf, time:1015270 msec, client:93.161.100.127, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
REQUEST "580000c080010d78" (url:GET /Modules/FlipMag.swf, time:983602 msec, client:93.178.137.134, stage:ExecuteRequestHandler, module:ApplicationRequestRouting)
I've enabled FREB on the ARR machine and sure enough I get a number of 502.3 errors. I've looked at one in particular:
<failedRequest url="http://xxx.yyy:80/ViewFile.ashx?FileID=10599"
siteId="1"
appPoolId="DefaultAppPool"
processId="1680"
verb="GET"
remoteUserName=""
userName=""
tokenUserName="NT AUTHORITY\IUSR"
authenticationType="anonymous"
activityId="{00000000-0000-0000-8DDA-008080010097}"
failureReason="STATUS_CODE"
statusCode="502.3"
triggerStatusCode="502.3"
timeTaken="32870"
xmlns:freb="http://schemas.microsoft.com/win/2006/06/iis/freb"
>
Further into the log I can validate that the request is being routed to the newly added machine called CAESARS:
<EventData>
<Data Name="ContextId">{00000000-0000-0000-8DDA-008080010097}</Data>
<Data Name="RoutingReason">2</Data>
<Data Name="Server">CAESARS</Data>
<Data Name="State">0</Data>
<Data Name="TotalRequests">7889</Data>
<Data Name="CurrentRequests">17</Data>
<Data Name="BytesSent">5901518</Data>
<Data Name="BytesReceived">298776461</Data>
<Data Name="ResponseTime">135</Data>
</EventData>
However, if I review the current requests on CAESARS, there are no outstanding ones, and all is perfect in the logs. Also, CAESARS does not have any data on the above request in it's server wide IIS log.
It would seem that the ARR machine is routing the request to the content server but the content server doesn't receive it for some reason. Currently I've got 86 failed requests out of 15k served (it's running at a weight of 5% to avoid too much disturbance), so it's far from all requets that fail, but certainly too many - especially as they interfere with the load balancing algorithm.
I'm eager to hear any suggestions on what I can do to debug the issue. I've got FREB files available as well as IIS logs for the content servers. With a FREB file, I don't suppose an IIS log of the ARR machine makes sense.
EDIT: CAESARS is running Server 2008 R2 Web x64 while the others are Server 2008 Web x64's, but besides IIS7 vs 7.5 I don't think it should matter.
Thanks!