Over the past several months we have begun seeing unexpected crashes with IIS7 and specifically the w3wp.exe process. Unfortunately for us when this happens it of course takes down the associated application pool and with it an extensive cache that we use to optimize delivery of data to our clients. Upon restart, the application must rebuild this cache which takes time and prevents our users from using the application until its cache is rebuilt. Obviously any time something like this happens it is a big deal, but because of the time to rebuild our application's cache it heightened the urgency to try to determine the cause.
We started with the Windows event log to look for application errors. As plain as day there was an error that in fact pointed to the w3wp.exe process dying with the faulting module being ntdll.dll. The process termination code was 0xc0000374 or STATUS_HEAP_CORRUPTION. The description provided by Microsoft in their documentation for the code did not provide much help stating simply "A heap has been corrupted". And while we were able to at least determine that there was a culprit, there wasn't a smoking gun pointing to what process or more importantly what code was at fault.

Doing some additional research, there are many varying explanations for what might cause the heap to become corrupted. While reading what others had to say, I kept thinking how is this even possible from within an ASP.NET web application? To better understand it let's discuss heap corruption in general. Common causes are what you think they would be - buffer overrun (writing outside of your addressable memory space), a double free (attempting to free a pointer twice) or attempting to re-use already freed objects. What also appeared a bit odd to me was that in Windows it is quite possible for a block in the heap for a process to be corrupted and its execution to continue. It is not until the process hits the corrupted block in its heap that it crashes also making it a bit more tricky to troubleshoot.
Okay, but I still have not even addressed how in the world this could be affecting our ASP.NET web application. Maybe I am in the minority but my initial instincts were that it was something we were doing that was causing the IIS worker process to crash. In an interesting blog
entry Microsoft's David Wang when describing the origin for their Debug Diagnostic tool (more about that later) tries to address developers urge to put the blame elsewhere by giving the following statistic from their support of IIS:
PSS statistics show that over 90% of their "IIS is crashing/hanging" cases resolve to 3rd party code doing the crashing/hanging. Just something for you to think about...
From the scouring the web, I found two examples of what we might be doing that would cause this sort of corruption. The first type of scenario more or less aligned with what you might expect. We might in our code be attempting to dispose of an object that had already been freed or trying to access an object that has already been disposed of or has not yet been created/allocated. Typically though when this occurs you see exceptions such as "Object reference not set to an instance of an object." so while it was a possibility it was still safe to assume its ability to cause the w3wp process to crash would be remote.
The other scenario that I found described was a case where you might have a runaway process that is caught in some infinite loop or non-terminal recursion logic. Again, this scenario seemed a bit of a reach for our particular issue. While it was possible that certain conditions in production might cause this runaway doomsday scenario it didn't stand to reason it would have not manifested itself during our unit testing or with our QA group. Regardless of the cause, the advice to find a solution was try to obtain a stack trace. I had to go to many different sources to compile a series of troubleshooting steps that would get me to a stack trace. The remainder of this post will detail the step-by-step approach I used to complete my quest to obtain a stack trace and find the offending cuprit.
Whenever I start out troubleshooting an issue that is only occuring in production I always try to start with the least intrusive method I can to minimize the impact (e.g., not having to install any other software on our production web servers/etc.). One such method I found was using the Windows Error Reporting (WER) Console to look back at past crashes to attempt to retrieve an associated stack trace file and to see if there is a solution (e.g., hotfix) from Microsoft. The console can be launched from the command line by typing WERCON.
Clicking on the "View Problem History" will display a list of problems (i.e., in our case crashes) that Windows has identified. The most recent process was an issue with an IIS Worker Process and had a date/time stamp that matched the application error in the event log.
Double-clicking on the process provides even more detail on the issue pointing to a crash of the w3wp.exe process. A few interesting pieces of information can be obtained from the details. This particular server was not setup to capture stack traces (i.e., dump files) when its IIS worker processes crashes (otherwise these would have been listed in the "Files that help descript the problem" section). This probably points to a configuration / setup with the server. More information on how to configure this can be found
here. For the time being, agreing with my make no changes mantra, I opted not to set this up.
Also, there was not a specific DLL or process referenced in the "Fault Module Name" field. I would have expected ntdll.dll to be referenced but instead if refers to StackHash_029a. An in-depth explaniation for the StackHash reference can be found at this
link and I think the following their description below is solid:
Therefore, StackHash is not a real module. It is a constructed name because the Instruction Pointer was pointing to no known module at the time of the crash. The number after the StackHash_ is a semi-unique number calculated at the time of the crash such that if the same crash occurred on multiple PCs then they have a reasonable chance to be correlated.
One other interesting aspect of their post is that it described potential causes that might be outside of our code (to include the 3rd party DLLs that we interact with such as Oracle's ODP.NET). An entirely new set of culprits such as other COTS software installed on our production web servers (in particular security and virus scanning software), a virus, faulty hardware whether it is a HD or memory could be causing the IIS worker processes to crash. A bit unerving, but as other reserach had pointed out the folks at StackHash also recommended to start with a stack trace.
So now that we had reached a point of still not having a stack trace, advice for the next step all seemed to point towards installing Microsoft's Debug Diagnostic (DebugDiag) tool. It can be downloaded from
here. In addition to providing the ability to capture a stack trace when one of your IIS worker process crashes, the tool provides a slew of other diagnostic capabilities that in general let you evaluate slow/hanging processes and potential memory leaks. Below is the almost stupidly easy sequence to add a rule to track when an IIS related process/thread dies.
While installing and setting up the DebugDiag it occurred to me that it might be more useful to track performance issues related to the responsiveness of our applicatins and IIS. Again, adding a rule from the tool proved very simple with the most difficult issue being able to find a resting place large enough for the dumps that are produced. We selected the default timeout of two minutes to trigger a dump as none of our requests should take longer than that to complete.
While our IIS thread crashes were intermittant, it actually did not take very long for the peformance rule to trigger. Once we had captured a set of dump files, the DebugDiag tool provides a way to initiate an analysis of the dump file.
For us, the full dump when produced (required an update to rule as this was not set by default in sequence above) was much more interesting. We in fact were able to narrow down our search to Oracle's Data Access Components (ODAC). In particular we determined that when requests did hang they tended to do so in the finalize/dispose methods within the ODAC. This at least provided us a clue where our issues were within our application and initiated a new troubleshooting effort focused on our data layer libraries that integrate w/ the ODAC (to be detailed in a future post).
00000000`04ebef40 000007ff`002c3039 Oracle_DataAccess!Oracle.DataAccess.Types.OracleRefCursor.Dispose(Boolean)+0x1a3 00000000`04ebf090 000007fe`f85b14a6 Oracle_DataAccess!Oracle.DataAccess.Types.OracleRefCursor.Finalize()+0x19 00000000`04ebf0d0 000007fe`f8448b61 mscorwks!FastCallFinalizeWorker+0x6 00000000`04ebf100 000007fe`f8448dac mscorwks!FastCallFinalize+0xb1 00000000`04ebf160 000007fe`f844d8a4 mscorwks!MethodTable::CallFinalizer+0xfc 00000000`04ebf1a0 000007fe`f84ba58b mscorwks!SVR::CallFinalizer+0x84 00000000`04ebf200 000007fe`f84ba32b mscorwks!SVR::DoOneFinalization+0xdb 00000000`04ebf2d0 000007fe`f856a75b mscorwks!SVR::FinalizeAllObjects+0x9b 00000000`04ebf390 000007fe`f8509374 mscorwks!SVR::FinalizeAllObjects_Wrapper+0x1b 00000000`04ebf3c0 000007fe`f8402045 mscorwks!ManagedThreadBase_DispatchInner+0x2c 00000000`04ebf410 000007fe`f8516139 mscorwks!ManagedThreadBase_DispatchMiddle+0x9d 00000000`04ebf4e0 000007fe`f83cc985 mscorwks!ManagedThreadBase_DispatchOuter+0x31 00000000`04ebf520 000007fe`f850ef1f mscorwks!ManagedThreadBase_DispatchInCorrectAD+0x15 00000000`04ebf550 000007fe`f87f54f1 mscorwks!Thread::DoADCallBack+0x12f 00000000`04ebf6b0 000007fe`f84ba680 mscorwks!ManagedThreadBase_DispatchInner+0x2ec1a9 00000000`04ebf700 000007fe`f84ba32b mscorwks!SVR::DoOneFinalization+0x1d0 00000000`04ebf7d0 000007fe`f84f2ebd mscorwks!SVR::FinalizeAllObjects+0x9b 00000000`04ebf890 000007fe`f8509374 mscorwks!SVR::GCHeap::FinalizerThreadWorker+0x9d 00000000`04ebf8d0 000007fe`f8402045 mscorwks!ManagedThreadBase_DispatchInner+0x2c 00000000`04ebf920 000007fe`f8516139 mscorwks!ManagedThreadBase_DispatchMiddle+0x9d 00000000`04ebf9f0 000007fe`f853cf1a mscorwks!ManagedThreadBase_DispatchOuter+0x31 00000000`04ebfa30 000007fe`f850f884 mscorwks!ManagedThreadBase_NoADTransition+0x42 00000000`04ebfa90 000007fe`f85314fc mscorwks!SVR::GCHeap::FinalizerThreadStart+0x74 00000000`04ebfad0 00000000`779cbe3d mscorwks!Thread::intermediateThreadProc+0x78 00000000`04ebfba0 00000000`77b06a51 kernel32!BaseThreadInitThunk+0xd 00000000`04ebfbd0 00000000`00000000 ntdll!RtlUserThreadStart+0x21