Friday, August 26, 2011

What to Dispose in Oracle ODP.NET


We recently began running into issues with performance on our IIS server.  As described in a prior post worker thread processes would crash or become unresponsive.  Our research found that it was our code that was the cause.  In fact it was due to how we were handling disposing of Oracle connection and especially command objects within the Oracle Data Provider for .NET (ODP.NET).  As an aside, how many times does one assume that it is outside of your code but time and time again it is your stuff that is causing the problem and not something else - certainly a topic for a future post.
It wasn't as if our original code totally ignored disposing of ODP.NET objects.  In fact we had been pretty consistent with how we disposed of the Oracle connection and command objects.  The example below taken from one of our data services (modified slightly to show all calls/etc.) illustrates how we were previously dealing with "cleaning up" after using the ODP.NET.
Public Shared Function GetDevicesByType(ByVal AConnectString as String, ByVal ADeviceType As Integer) As List (Of Device)   Dim lDeviceList As List (Of Device) = New List (of Device)   Dim lReturnOraParam As OracleParameter  Dim lDbConn As New OracleConnection(AConnectString)  Dim lOraCmd As New OracleCommand  Dim lOraDataAdapt As New OracleDataAdapter  Dim lOraDataTab As New DataTable  Try     lDbConn.Open()     lDbConn.Connection = lDbConn     lDbConn.CommandType = CommandType.StoredProcedure     lDbConn.CommandText = "get_device_plsql.get_device"      lReturnOraParam = lDbConn.Parameters.Add("rtn_code", OracleDbType.Int32, 10, ParameterDirection.ReturnValue)      lDbConn.Parameters.Add("device_type", OracleDbType.Int32, 50, ADeviceType, ParameterDirection.Input)     lDbConn.Parameters.Add("device_rc", OracleDbType.RefCursor, ParameterDirection.Output)      ' call plsql     lDbConn.ExecuteNonQuery()      lOraDataAdapt.SelectCommand = lDbConn     lOraDataAdapt.Fill(lOraDataTab)      For Each oRow As DataRow In lOraDataTab.Rows        AddDevice(lDeviceList, oRow)     Next   Catch ex As Exception      Throw New Exception("Get Device Failure: ", ex)   Finally      lDbConn = Nothing      If Not lOraDataTab Is Nothing Then        lOraDataTab.Dispose()     End If     lOraDataTab = Nothing      If Not lOraDataAdapt Is Nothing Then        lOraDataAdapt.Dispose()    End If    lOraDataAdapt = Nothing     If Not lDbConn Is Nothing AndAlso lDbConn.State = ConnectionState.Open Then       lDbConn.Close()       lDbConn.Dispose()   End If   lDbConn = Nothing  End Try   Return lDeviceList End Function
There however was one important set of objects that we did not properly dispose of the parameters and more importantly their values.  We made a decision at the outset of our last rewrite of our front-end to build our data layer retrieving data using a REF CURSOR.
For Each p As OracleParameter In cmd.Parameters  If TypeOf p Is IDisposable Then    If TypeOf p.Value Is IDisposable Then      DirectCast(p.Value, IDisposable).Dispose()    End If    DirectCast(p, IDisposable).Dispose()  End If Next
The importance to making sure all of the parameters and their values are properly disposed is because the ODP uses both managed and unmanaged code.  It is the unamanaged code that requires an explicit call to Dispose so that it is not left up to the mercy of the garbage collector.

Sunday, August 21, 2011

Troubleshooting IIS7 Worker Thread Crashes (w3wp.exe) with a FaultingModule of ntdll.dll


Over the past several months we have begun seeing unexpected crashes with IIS7 and specifically the w3wp.exe process.  Unfortunately for us when this happens it of course takes down the associated application pool and with it an extensive cache that we use to optimize delivery of data to our clients.  Upon restart, the application must rebuild this cache which takes time and prevents our users from using the application until its cache is rebuilt.  Obviously any time something like this happens it is a big deal, but because of the time to rebuild our application's cache it heightened the urgency to try to determine the cause.
We started with the Windows event log to look for application errors.  As plain as day there was an error that in fact pointed to the w3wp.exe process dying with the faulting module being ntdll.dll.  The process termination code was 0xc0000374 or STATUS_HEAP_CORRUPTION.  The description provided by Microsoft in their documentation for the code did not provide much help stating simply "A heap has been corrupted".  And while we were able to at least determine that there was a culprit, there wasn't a smoking gun pointing to what process or more importantly what code was at fault.

Evtviewersmall

Doing some additional research, there are many varying explanations for what might cause the heap to become corrupted.  While reading what others had to say, I kept thinking how is this even possible from within an ASP.NET web application?  To better understand it let's discuss heap corruption in general.  Common causes are what you think they would be - buffer overrun (writing outside of your addressable memory space), a double free (attempting to free a pointer twice) or attempting to re-use already freed objects.  What also appeared a bit odd to me was that in Windows it is quite possible for a block in the heap for a process to be corrupted and its execution to continue.  It is not until the process hits the corrupted block in its heap that it crashes also making it a bit more tricky to troubleshoot.
Okay, but I still have not even addressed how in the world this could be affecting our ASP.NET web application.  Maybe I am in the minority but my initial instincts were that it was something we were doing that was causing the IIS worker process to crash.  In an interesting blog entry Microsoft's David Wang when describing the origin for their Debug Diagnostic tool (more about that later) tries to address developers urge to put the blame elsewhere by giving the following statistic from their support of IIS:
PSS statistics show that over 90% of their "IIS is crashing/hanging" cases resolve to 3rd party code doing the crashing/hanging. Just something for you to think about...
From the scouring the web, I found two examples of what we might be doing that would cause this sort of corruption.  The first type of scenario more or less aligned with what you might expect.  We might in our code be attempting to dispose of an object that had already been freed or trying to access an object that has already been disposed of or has not yet been created/allocated.  Typically though when this occurs you see exceptions such as "Object reference not set to an instance of an object." so while it was a possibility it was still safe to assume its ability to cause the w3wp process to crash would be remote.

The other scenario that I found described was a case where you might have a runaway process that is caught in some infinite loop or non-terminal recursion logic.  Again, this scenario seemed a bit of a reach for our particular issue.  While it was possible that certain conditions in production might cause this runaway doomsday scenario it didn't stand to reason it would have not manifested itself during our unit testing or with our QA group.  Regardless of the cause, the advice to find a solution was  try to obtain a stack trace.  I had to go to many different sources to compile a series of troubleshooting steps that would get me to a stack trace.  The remainder of this post will detail the step-by-step approach I used to complete my quest to obtain a stack trace and find the offending cuprit.

Whenever I start out troubleshooting an issue that is only occuring in production I always try to start with the least intrusive method I can to minimize the impact (e.g., not having to install any other software on our production web servers/etc.).  One such method I found was using the Windows Error Reporting (WER) Console to look back at past crashes to attempt to retrieve an associated stack trace file and to see if there is a solution (e.g., hotfix) from Microsoft.  The console can be launched from the command line by typing WERCON.

Wercon
Clicking on the "View Problem History" will display a list of problems (i.e., in our case crashes) that Windows has identified.  The most recent process was an issue with an IIS Worker Process and had a date/time stamp that matched the application error in the event log.

Wercon-procs
Double-clicking on the process provides even more detail on the issue pointing to a crash of the w3wp.exe process.  A few interesting pieces of information can be obtained from the details.  This particular server was not setup to capture stack traces (i.e., dump files) when its IIS worker processes crashes (otherwise these would have been listed in the "Files that help descript the problem" section).  This probably points to a configuration / setup with the server.  More information on how to configure this can be found here.  For the time being, agreing with my make no changes mantra, I opted not to set this up.

Wercon-proc-detail

Also, there was not a specific DLL or process referenced in the "Fault Module Name" field. I would have expected ntdll.dll to be referenced but instead if refers to StackHash_029a.  An in-depth explaniation for the StackHash reference can be found at this link and I think the following their description below is solid:
Therefore, StackHash is not a real module. It is a constructed name because the Instruction Pointer was pointing to no known module at the time of the crash. The number after the StackHash_ is a semi-unique number calculated at the time of the crash such that if the same crash occurred on multiple PCs then they have a reasonable chance to be correlated.
One other interesting aspect of their post is that it described potential causes that might be outside of our code (to include the 3rd party DLLs that we interact with such as Oracle's ODP.NET).  An entirely new set of culprits such as other COTS software installed on our production web servers (in particular security and virus scanning software), a virus, faulty hardware whether it is a HD or memory could be causing the IIS worker processes to crash.  A bit unerving, but as other reserach had pointed out the folks at StackHash also recommended to start with a stack trace.

So now that we had reached a point of still not having a stack trace, advice for the next step all seemed to point towards installing Microsoft's Debug Diagnostic (DebugDiag) tool.  It can be downloaded from here. In addition to providing the ability to capture a stack trace when one of your IIS worker process crashes, the tool provides a slew of other diagnostic capabilities that in general let you evaluate slow/hanging processes and potential memory leaks.  Below is the almost stupidly easy sequence to add a rule to track when an IIS related process/thread dies.
Crash-rule-all
While installing and setting up the DebugDiag it occurred to me that it might be more useful to track performance issues related to the responsiveness of our applicatins and IIS.  Again, adding a rule from the tool proved very simple with the most difficult issue being able to find a resting place large enough for the dumps that are produced.  We selected the default timeout of two minutes to trigger a dump as none of our requests should take longer than that to complete. 
Perf-rule-all
While our IIS thread crashes were intermittant, it actually did not take very long for the peformance rule to trigger.  Once we had captured a set of dump files, the DebugDiag tool provides a way to initiate an analysis of the dump file. 
Debugdiag-perf-anal
For us, the full dump when produced (required an update to rule as this was not set by default in sequence above) was much more interesting.  We in fact were able to narrow down our search to Oracle's Data Access Components (ODAC).  In particular we determined that when requests did hang they tended to do so in the finalize/dispose methods within the ODAC.  This at least provided us a clue where our issues were within our application and initiated a new troubleshooting effort focused on our data layer libraries that integrate w/ the ODAC (to be detailed in a future post).
00000000`04ebef40 000007ff`002c3039 Oracle_DataAccess!Oracle.DataAccess.Types.OracleRefCursor.Dispose(Boolean)+0x1a3 00000000`04ebf090 000007fe`f85b14a6 Oracle_DataAccess!Oracle.DataAccess.Types.OracleRefCursor.Finalize()+0x19 00000000`04ebf0d0 000007fe`f8448b61 mscorwks!FastCallFinalizeWorker+0x6 00000000`04ebf100 000007fe`f8448dac mscorwks!FastCallFinalize+0xb1 00000000`04ebf160 000007fe`f844d8a4 mscorwks!MethodTable::CallFinalizer+0xfc 00000000`04ebf1a0 000007fe`f84ba58b mscorwks!SVR::CallFinalizer+0x84 00000000`04ebf200 000007fe`f84ba32b mscorwks!SVR::DoOneFinalization+0xdb 00000000`04ebf2d0 000007fe`f856a75b mscorwks!SVR::FinalizeAllObjects+0x9b 00000000`04ebf390 000007fe`f8509374 mscorwks!SVR::FinalizeAllObjects_Wrapper+0x1b 00000000`04ebf3c0 000007fe`f8402045 mscorwks!ManagedThreadBase_DispatchInner+0x2c 00000000`04ebf410 000007fe`f8516139 mscorwks!ManagedThreadBase_DispatchMiddle+0x9d 00000000`04ebf4e0 000007fe`f83cc985 mscorwks!ManagedThreadBase_DispatchOuter+0x31 00000000`04ebf520 000007fe`f850ef1f mscorwks!ManagedThreadBase_DispatchInCorrectAD+0x15 00000000`04ebf550 000007fe`f87f54f1 mscorwks!Thread::DoADCallBack+0x12f 00000000`04ebf6b0 000007fe`f84ba680 mscorwks!ManagedThreadBase_DispatchInner+0x2ec1a9 00000000`04ebf700 000007fe`f84ba32b mscorwks!SVR::DoOneFinalization+0x1d0 00000000`04ebf7d0 000007fe`f84f2ebd mscorwks!SVR::FinalizeAllObjects+0x9b 00000000`04ebf890 000007fe`f8509374 mscorwks!SVR::GCHeap::FinalizerThreadWorker+0x9d 00000000`04ebf8d0 000007fe`f8402045 mscorwks!ManagedThreadBase_DispatchInner+0x2c 00000000`04ebf920 000007fe`f8516139 mscorwks!ManagedThreadBase_DispatchMiddle+0x9d 00000000`04ebf9f0 000007fe`f853cf1a mscorwks!ManagedThreadBase_DispatchOuter+0x31 00000000`04ebfa30 000007fe`f850f884 mscorwks!ManagedThreadBase_NoADTransition+0x42 00000000`04ebfa90 000007fe`f85314fc mscorwks!SVR::GCHeap::FinalizerThreadStart+0x74 00000000`04ebfad0 00000000`779cbe3d mscorwks!Thread::intermediateThreadProc+0x78 00000000`04ebfba0 00000000`77b06a51 kernel32!BaseThreadInitThunk+0xd 00000000`04ebfbd0 00000000`00000000 ntdll!RtlUserThreadStart+0x21


Sunday, August 7, 2011

Mogrify Me

WebsiteIn my copious amounts of free time I maintain a website for our twin girls.  Although it is now dedicated to my children, it was created years ago before the proliferation of photo sharing sites as a place where I could post pictures of my wife and myself travelling.  I use a PHP based photoblogging package called PixelPost (http://www.pixelpost.org/) and in general like having the freedom to control how the website is locked down (I opted for minimalist approach and use basic authentication in Apache) ensuring photos of the kids are not readily accessible to the all powerful Google and others (naieve thinking I suppose but maybe being a very small fish in a large pond has its advantages).  In retrospect if I had to do it over again perhaps I would have become an early adopter of an online service like Flickr or others but it serves its purpose and my relatives are hooked on using it.
Fast forward to this past Christmas where I jumped headfirst into the wonderful world of digital SLRs.  After quite a bit of research I purchased a Canon EOS REBEL T2i because at the time it was the only model in its class that had full HD video.  Mainly though I just wanted it for its rediculously fast FPS ensuring I wouldn't miss "the perfect photo" with the twins running around at full speed.  At the same time I purchased the SLR, I also bought an Eye-Fi SD Card (http://www.eye.fi/) for the camera.
The capabilities in the Eye-Fi card presented a unique opportunity to play towards my laziness with the twins website.  The greatest chore I found with the site was taking the SD Card from my camera out and copying up whatever photos I wanted to publish.  This card would remove that step and with a bit of extra work I could automate the whole process (as a side note, the Eye-Fi also has an online sharing feature that allows you to integrate with Flickr, Facebook, Picassa and MobileMe).  Using the wireless ftp feature of the card I was able to setup a transfer of images I protected on the camera to my Mini-Mac.  I then created a few cron scripts that take the uploaded images and push them to the twins website.

This is where things ended up getting interesting.  Along with storing the full image for backup/recovery purposes, I wanted to be able to script an automatic resize of the image to 800x600 upon upload as well as rotate it if needed (i.e., if it was taken as a portrait) and create a thumbnail image (for the PixelPost Admin interface).   Using the ImageMagick package (and specifically the mogrify command) this in fact was pretty painless.  Again, my whole goal in this was to make a completely automated process from the point I protect the image on my camera to when the images are published on the twins website.
Resizing was as easy as issuing the following command:
mogrify -resize 800x600 $destimg
Rotating was just a bit harder as I used the following PHP script I created based on source in PixelPost to extract out the EXIF data for the image:
<!--?php date_default_timezone_set('UTC');$img = "$argv[1]";$myorient = copy_folder_get_exif_orient($img);function copy_folder_get_exif_orient($image_name){ if (file_exists("/var/www/html/twinswebsite/includes/exifer1_5/exif.php")) require_once("/var/www/html/twinswebsite/includes/exifer1_5/exif.php"); $curr_image = $image_name; if (function_exists('read_exif_data_raw')) { $exif_result = read_exif_data_raw($curr_image,"0"); $orient = $exif_result['IFD0']['Orientation']; // Date and Time if ($orient=='') $orient = 'bad'; } else $orient = "bad"; return $orient;}?>
From the extracted EXIF data, it was possible to parse the orientation of the image:
theData=`php /var/www/html/twinswebsite/admin/ext_exif_orient.php $destimg`theDegrees=`echo "$theData" | awk '{print $1}'`
If the orientation of the image is portrait, the script is able to rotate the image as appropriate:
if [ $theDegrees == "Normal" ]then echo "$fullimg is landscape no rotation necessary"else theDirection=`echo "$theData" | awk '{print $3}'` if [ $theDirection == "CW" ] then mogrify -rotate 270 $destimg mogrify -rotate 270 $thumbimg else mogrify -rotate 90 $destimg mogrify -rotate 90 $thumbimg fifi
It is amazing sometimes how sheer lazyness pays off in the end.  Now instead of images sitting on my camera, I really do not have any excuse not to find images I want to publish, protect them, and have them be published automatically.  The entire process end-to-end takes only a few minutes from the time I protect an image to when it is published on my PixelPost photoblog.
-->