From the Fiefdom: 2011

Wednesday, September 28, 2011

VB.NET Stupidisms - IsNot and AndAlso

Let's take the IsNot operator that will let you write:
If Not lMyVar is Nothingthen ...do something... end if
as
If lMyVar IsNot Nothingthen ...do something... end if
In this particular case there hasn't been any signfiicant performance or recommendation to use either form except that the IsNot operator makes the code more readable (IMHO). Wikipedia had a funny reference to a patent application that Microsoft had filed on the operator. While the example code above depicts using a comparison against Nothing the comparison can be done against any reference.
A more important stupidism that I thought worth posting has to do with how Visual Basic implements short-circuiting in boolean comparisons through the use of the AndAlso operator. This allows for bypassing of future operands in the expression if a prior one determines the outcome of the evaluation. For example, if all of the operands in an IF/THEN are evaluating to see whether the result is true, if any of hte operands return false, there is no need to continue evaluating the rest.
For example, in the evaluation below, both the first and second operand are evaluated in order to determine the result. This results in two operatins/comparisons being made even if the first operand returns false (e.g., lValue = 10).
If lValue > 50 and lValue < 100 Then
Using a short-circuited evaluation with the AndAlso operator avoids this as follows:
If lValue > 50 AndAlso lValue < 100 Then
A friend reminded me that Basic has been this way forever and in fact Microsoft introduced the AndAlso operator to address this shortcoming. He also scolded me for doing too much C++ development as of late. Wikipedia has a great reference that includes a table (included in this posting) outlining how various languages provide for short-circuited boolean evaluations.

Language	Eager operators	Short-circuit operators	Result type
ABAP	none	`and` , `or`	Boolean¹
Ada, Eiffel	`and` , `or`	`and then` , `or else`	Boolean
ALGOL 68	and , & , ∧ ; or , ∨	andf , orf (both user defined)	Boolean
C²	none	`&&` , `\|\|`, `?`^[1]	Numeric (`&&`,`\|\|`), opnd-dependent (`?`)
C++	`&`, `\|`	`&&` , `\|\|`, `?`^[2]	Boolean (`&&`,`\|\|`), opnd-dependent (`?`)
Go, Objective Caml, Haskell	none	`&&` , `\|\|`	Boolean
C#, Java, MATLAB,^[3] R	`&` , `\|`	`&&` , `\|\|`	Boolean
ColdFusion	none	`AND` , `OR` , `&&` , `\|\|`	Boolean
Erlang	`and`, `or`	`andalso` , `orelse`	Boolean
Fortran	`.and.` , `.or.`		Boolean
JavaScript	none	`&&` , `\|\|`	Last value
Lisp, Lua, Scheme	none	`and` , `or`	Last value
Modula-2	none	`AND` , `OR`	Boolean
Oberon	none	`&` , `OR`	Boolean
Pascal	`and`, `or`³	`and_then` , `or_else`⁴	Boolean
Perl, Ruby	`&` , `\|`	`&&` , `and` , `\|\|` , `or`	Last value
PHP	none	`&&` , `and` , `\|\|` , `or`	Boolean
Python	none	`and` , `or`	Last value
Smalltalk	`&` , `\|`	`and:` , `or:`	Boolean
Standard ML	Unknown	`andalso` , `orelse`	Boolean
Visual Basic .NET	`And` , `Or`	`AndAlso` , `OrElse`	Boolean
VB Script, VB Classic, VBA	`And` , `Or`	`Select Case`	Numeric

Thursday, September 1, 2011

VB.NET Stupidisms - DirectCast over CType

I've been diving a bit more into vb.net development recently and decided for my own sanity it might be worthwhile to post some stupid simple notes on what I am learning as I go along.
The first has to do with using DirectCast instead of CType. The easy answer is that DirectCast performance is marketably better. The restriction is that you need to ensure you are essentially doing a cast as opposed to a conversion.
The better and longer answer that I'm going to quote directly from can be found in the repsonse to a question posted here.

Here's a quick outline of the differences and recommendations of where you
should ctype and directcast (IHMO anyway)
1) CType is capable of a *cast* or a *conversion*. DirectCast can only
*cast*
By "conversion" I mean converting one datatype to another (e.g. string to
integer, decimal to integer, object to string etc).
By "cast" I mean changing one type of object into another type that is
related to it by one of the following rules:
a) The type you're converting the object to must be the same type
e.g.
--------------
Dim a as String = "hello"
Dim b as Object = a
Dim c as string = directcast(b, String)
--------------
Variable b is an object that holds a string, so you can cast it to a string.
b) If converting to an interface, the type you're converting must implement
the interface
e.g.
---------------------
Dim a as New MyInterfaceObejct
Dim b as IInterface = directcast(a, IInterface)
----------------------
c) If converting to a derived type, the runtime type of the object must be
the derived type or one of it's own derived types :S
e.g.
----------------------
Dim a as Base = New Derived
Dim b as Derived = directcast(a, Derived)
----------------------
2) Use directcast whenever a "type relationship" exists - it is slightly
faster (in some cases anyway), but forces you to be more aware of
conversions that are going on.
3) Use Ctype when a type relationship doesn't exist, but a value
relationship does (e.g. converting the string "123" to the integer 123)

Hopefully this particular description was is as helpful to others as it was for me.

Friday, August 26, 2011

What to Dispose in Oracle ODP.NET

We recently began running into issues with performance on our IIS server. As described in a prior post worker thread processes would crash or become unresponsive. Our research found that it was our code that was the cause. In fact it was due to how we were handling disposing of Oracle connection and especially command objects within the Oracle Data Provider for .NET (ODP.NET). As an aside, how many times does one assume that it is outside of your code but time and time again it is your stuff that is causing the problem and not something else - certainly a topic for a future post.
It wasn't as if our original code totally ignored disposing of ODP.NET objects. In fact we had been pretty consistent with how we disposed of the Oracle connection and command objects. The example below taken from one of our data services (modified slightly to show all calls/etc.) illustrates how we were previously dealing with "cleaning up" after using the ODP.NET.

Public Shared Function GetDevicesByType(ByVal AConnectString as String, ByVal ADeviceType As Integer) As List (Of Device)   Dim lDeviceList As List (Of Device) = New List (of Device)   Dim lReturnOraParam As OracleParameter  Dim lDbConn As New OracleConnection(AConnectString)  Dim lOraCmd As New OracleCommand  Dim lOraDataAdapt As New OracleDataAdapter  Dim lOraDataTab As New DataTable  Try     lDbConn.Open()     lDbConn.Connection = lDbConn     lDbConn.CommandType = CommandType.StoredProcedure     lDbConn.CommandText = "get_device_plsql.get_device"      lReturnOraParam = lDbConn.Parameters.Add("rtn_code", OracleDbType.Int32, 10, ParameterDirection.ReturnValue)      lDbConn.Parameters.Add("device_type", OracleDbType.Int32, 50, ADeviceType, ParameterDirection.Input)     lDbConn.Parameters.Add("device_rc", OracleDbType.RefCursor, ParameterDirection.Output)      ' call plsql     lDbConn.ExecuteNonQuery()      lOraDataAdapt.SelectCommand = lDbConn     lOraDataAdapt.Fill(lOraDataTab)      For Each oRow As DataRow In lOraDataTab.Rows        AddDevice(lDeviceList, oRow)     Next   Catch ex As Exception      Throw New Exception("Get Device Failure: ", ex)   Finally      lDbConn = Nothing      If Not lOraDataTab Is Nothing Then        lOraDataTab.Dispose()     End If     lOraDataTab = Nothing      If Not lOraDataAdapt Is Nothing Then        lOraDataAdapt.Dispose()    End If    lOraDataAdapt = Nothing     If Not lDbConn Is Nothing AndAlso lDbConn.State = ConnectionState.Open Then       lDbConn.Close()       lDbConn.Dispose()   End If   lDbConn = Nothing  End Try   Return lDeviceList  End Function

There however was one important set of objects that we did not properly dispose of the parameters and more importantly their values. We made a decision at the outset of our last rewrite of our front-end to build our data layer retrieving data using a REF CURSOR.

 For Each p As OracleParameter In cmd.Parameters  If TypeOf p Is IDisposable Then    If TypeOf p.Value Is IDisposable Then      DirectCast(p.Value, IDisposable).Dispose()    End If    DirectCast(p, IDisposable).Dispose()  End If Next

The importance to making sure all of the parameters and their values are properly disposed is because the ODP uses both managed and unmanaged code. It is the unamanaged code that requires an explicit call to Dispose so that it is not left up to the mercy of the garbage collector.

Sunday, August 21, 2011

Troubleshooting IIS7 Worker Thread Crashes (w3wp.exe) with a FaultingModule of ntdll.dll

Over the past several months we have begun seeing unexpected crashes with IIS7 and specifically the w3wp.exe process. Unfortunately for us when this happens it of course takes down the associated application pool and with it an extensive cache that we use to optimize delivery of data to our clients. Upon restart, the application must rebuild this cache which takes time and prevents our users from using the application until its cache is rebuilt. Obviously any time something like this happens it is a big deal, but because of the time to rebuild our application's cache it heightened the urgency to try to determine the cause.
We started with the Windows event log to look for application errors. As plain as day there was an error that in fact pointed to the w3wp.exe process dying with the faulting module being ntdll.dll. The process termination code was 0xc0000374 or STATUS_HEAP_CORRUPTION. The description provided by Microsoft in their documentation for the code did not provide much help stating simply "A heap has been corrupted". And while we were able to at least determine that there was a culprit, there wasn't a smoking gun pointing to what process or more importantly what code was at fault.

Doing some additional research, there are many varying explanations for what might cause the heap to become corrupted. While reading what others had to say, I kept thinking how is this even possible from within an ASP.NET web application? To better understand it let's discuss heap corruption in general. Common causes are what you think they would be - buffer overrun (writing outside of your addressable memory space), a double free (attempting to free a pointer twice) or attempting to re-use already freed objects. What also appeared a bit odd to me was that in Windows it is quite possible for a block in the heap for a process to be corrupted and its execution to continue. It is not until the process hits the corrupted block in its heap that it crashes also making it a bit more tricky to troubleshoot.

Okay, but I still have not even addressed how in the world this could be affecting our ASP.NET web application. Maybe I am in the minority but my initial instincts were that it was something we were doing that was causing the IIS worker process to crash. In an interesting blog entry Microsoft's David Wang when describing the origin for their Debug Diagnostic tool (more about that later) tries to address developers urge to put the blame elsewhere by giving the following statistic from their support of IIS:

PSS statistics show that over 90% of their "IIS is crashing/hanging" cases resolve to 3rd party code doing the crashing/hanging. Just something for you to think about...

From the scouring the web, I found two examples of what we might be doing that would cause this sort of corruption. The first type of scenario more or less aligned with what you might expect. We might in our code be attempting to dispose of an object that had already been freed or trying to access an object that has already been disposed of or has not yet been created/allocated. Typically though when this occurs you see exceptions such as "Object reference not set to an instance of an object." so while it was a possibility it was still safe to assume its ability to cause the w3wp process to crash would be remote.

The other scenario that I found described was a case where you might have a runaway process that is caught in some infinite loop or non-terminal recursion logic. Again, this scenario seemed a bit of a reach for our particular issue. While it was possible that certain conditions in production might cause this runaway doomsday scenario it didn't stand to reason it would have not manifested itself during our unit testing or with our QA group. Regardless of the cause, the advice to find a solution was try to obtain a stack trace. I had to go to many different sources to compile a series of troubleshooting steps that would get me to a stack trace. The remainder of this post will detail the step-by-step approach I used to complete my quest to obtain a stack trace and find the offending cuprit.

Whenever I start out troubleshooting an issue that is only occuring in production I always try to start with the least intrusive method I can to minimize the impact (e.g., not having to install any other software on our production web servers/etc.). One such method I found was using the Windows Error Reporting (WER) Console to look back at past crashes to attempt to retrieve an associated stack trace file and to see if there is a solution (e.g., hotfix) from Microsoft. The console can be launched from the command line by typing WERCON.

Clicking on the "View Problem History" will display a list of problems (i.e., in our case crashes) that Windows has identified. The most recent process was an issue with an IIS Worker Process and had a date/time stamp that matched the application error in the event log.

Double-clicking on the process provides even more detail on the issue pointing to a crash of the w3wp.exe process. A few interesting pieces of information can be obtained from the details. This particular server was not setup to capture stack traces (i.e., dump files) when its IIS worker processes crashes (otherwise these would have been listed in the "Files that help descript the problem" section). This probably points to a configuration / setup with the server. More information on how to configure this can be found here. For the time being, agreing with my make no changes mantra, I opted not to set this up.

Also, there was not a specific DLL or process referenced in the "Fault Module Name" field. I would have expected ntdll.dll to be referenced but instead if refers to StackHash_029a. An in-depth explaniation for the StackHash reference can be found at this link and I think the following their description below is solid:

Therefore, StackHash is not a real module. It is a constructed name because the Instruction Pointer was pointing to no known module at the time of the crash. The number after the StackHash_ is a semi-unique number calculated at the time of the crash such that if the same crash occurred on multiple PCs then they have a reasonable chance to be correlated.

One other interesting aspect of their post is that it described potential causes that might be outside of our code (to include the 3rd party DLLs that we interact with such as Oracle's ODP.NET). An entirely new set of culprits such as other COTS software installed on our production web servers (in particular security and virus scanning software), a virus, faulty hardware whether it is a HD or memory could be causing the IIS worker processes to crash. A bit unerving, but as other reserach had pointed out the folks at StackHash also recommended to start with a stack trace.

So now that we had reached a point of still not having a stack trace, advice for the next step all seemed to point towards installing Microsoft's Debug Diagnostic (DebugDiag) tool. It can be downloaded from here. In addition to providing the ability to capture a stack trace when one of your IIS worker process crashes, the tool provides a slew of other diagnostic capabilities that in general let you evaluate slow/hanging processes and potential memory leaks. Below is the almost stupidly easy sequence to add a rule to track when an IIS related process/thread dies.

While installing and setting up the DebugDiag it occurred to me that it might be more useful to track performance issues related to the responsiveness of our applicatins and IIS. Again, adding a rule from the tool proved very simple with the most difficult issue being able to find a resting place large enough for the dumps that are produced. We selected the default timeout of two minutes to trigger a dump as none of our requests should take longer than that to complete.

While our IIS thread crashes were intermittant, it actually did not take very long for the peformance rule to trigger. Once we had captured a set of dump files, the DebugDiag tool provides a way to initiate an analysis of the dump file.

For us, the full dump when produced (required an update to rule as this was not set by default in sequence above) was much more interesting. We in fact were able to narrow down our search to Oracle's Data Access Components (ODAC). In particular we determined that when requests did hang they tended to do so in the finalize/dispose methods within the ODAC. This at least provided us a clue where our issues were within our application and initiated a new troubleshooting effort focused on our data layer libraries that integrate w/ the ODAC (to be detailed in a future post).

00000000`04ebef40 000007ff`002c3039 Oracle_DataAccess!Oracle.DataAccess.Types.OracleRefCursor.Dispose(Boolean)+0x1a3 00000000`04ebf090 000007fe`f85b14a6 Oracle_DataAccess!Oracle.DataAccess.Types.OracleRefCursor.Finalize()+0x19 00000000`04ebf0d0 000007fe`f8448b61 mscorwks!FastCallFinalizeWorker+0x6 00000000`04ebf100 000007fe`f8448dac mscorwks!FastCallFinalize+0xb1 00000000`04ebf160 000007fe`f844d8a4 mscorwks!MethodTable::CallFinalizer+0xfc 00000000`04ebf1a0 000007fe`f84ba58b mscorwks!SVR::CallFinalizer+0x84 00000000`04ebf200 000007fe`f84ba32b mscorwks!SVR::DoOneFinalization+0xdb 00000000`04ebf2d0 000007fe`f856a75b mscorwks!SVR::FinalizeAllObjects+0x9b 00000000`04ebf390 000007fe`f8509374 mscorwks!SVR::FinalizeAllObjects_Wrapper+0x1b 00000000`04ebf3c0 000007fe`f8402045 mscorwks!ManagedThreadBase_DispatchInner+0x2c 00000000`04ebf410 000007fe`f8516139 mscorwks!ManagedThreadBase_DispatchMiddle+0x9d 00000000`04ebf4e0 000007fe`f83cc985 mscorwks!ManagedThreadBase_DispatchOuter+0x31 00000000`04ebf520 000007fe`f850ef1f mscorwks!ManagedThreadBase_DispatchInCorrectAD+0x15 00000000`04ebf550 000007fe`f87f54f1 mscorwks!Thread::DoADCallBack+0x12f 00000000`04ebf6b0 000007fe`f84ba680 mscorwks!ManagedThreadBase_DispatchInner+0x2ec1a9 00000000`04ebf700 000007fe`f84ba32b mscorwks!SVR::DoOneFinalization+0x1d0 00000000`04ebf7d0 000007fe`f84f2ebd mscorwks!SVR::FinalizeAllObjects+0x9b 00000000`04ebf890 000007fe`f8509374 mscorwks!SVR::GCHeap::FinalizerThreadWorker+0x9d 00000000`04ebf8d0 000007fe`f8402045 mscorwks!ManagedThreadBase_DispatchInner+0x2c 00000000`04ebf920 000007fe`f8516139 mscorwks!ManagedThreadBase_DispatchMiddle+0x9d 00000000`04ebf9f0 000007fe`f853cf1a mscorwks!ManagedThreadBase_DispatchOuter+0x31 00000000`04ebfa30 000007fe`f850f884 mscorwks!ManagedThreadBase_NoADTransition+0x42 00000000`04ebfa90 000007fe`f85314fc mscorwks!SVR::GCHeap::FinalizerThreadStart+0x74 00000000`04ebfad0 00000000`779cbe3d mscorwks!Thread::intermediateThreadProc+0x78 00000000`04ebfba0 00000000`77b06a51 kernel32!BaseThreadInitThunk+0xd 00000000`04ebfbd0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Sunday, August 7, 2011

Mogrify Me

In my copious amounts of free time I maintain a website for our twin girls. Although it is now dedicated to my children, it was created years ago before the proliferation of photo sharing sites as a place where I could post pictures of my wife and myself travelling. I use a PHP based photoblogging package called PixelPost (http://www.pixelpost.org/) and in general like having the freedom to control how the website is locked down (I opted for minimalist approach and use basic authentication in Apache) ensuring photos of the kids are not readily accessible to the all powerful Google and others (naieve thinking I suppose but maybe being a very small fish in a large pond has its advantages). In retrospect if I had to do it over again perhaps I would have become an early adopter of an online service like Flickr or others but it serves its purpose and my relatives are hooked on using it.
Fast forward to this past Christmas where I jumped headfirst into the wonderful world of digital SLRs. After quite a bit of research I purchased a Canon EOS REBEL T2i because at the time it was the only model in its class that had full HD video. Mainly though I just wanted it for its rediculously fast FPS ensuring I wouldn't miss "the perfect photo" with the twins running around at full speed. At the same time I purchased the SLR, I also bought an Eye-Fi SD Card (http://www.eye.fi/) for the camera.
The capabilities in the Eye-Fi card presented a unique opportunity to play towards my laziness with the twins website. The greatest chore I found with the site was taking the SD Card from my camera out and copying up whatever photos I wanted to publish. This card would remove that step and with a bit of extra work I could automate the whole process (as a side note, the Eye-Fi also has an online sharing feature that allows you to integrate with Flickr, Facebook, Picassa and MobileMe). Using the wireless ftp feature of the card I was able to setup a transfer of images I protected on the camera to my Mini-Mac. I then created a few cron scripts that take the uploaded images and push them to the twins website.

This is where things ended up getting interesting. Along with storing the full image for backup/recovery purposes, I wanted to be able to script an automatic resize of the image to 800x600 upon upload as well as rotate it if needed (i.e., if it was taken as a portrait) and create a thumbnail image (for the PixelPost Admin interface). Using the ImageMagick package (and specifically the mogrify command) this in fact was pretty painless. Again, my whole goal in this was to make a completely automated process from the point I protect the image on my camera to when the images are published on the twins website.
Resizing was as easy as issuing the following command:
mogrify -resize 800x600 $destimg
Rotating was just a bit harder as I used the following PHP script I created based on source in PixelPost to extract out the EXIF data for the image:

<!--?php date_default_timezone_set('UTC');$img = "$argv[1]";$myorient = copy_folder_get_exif_orient($img);function copy_folder_get_exif_orient($image_name){ if (file_exists("/var/www/html/twinswebsite/includes/exifer1_5/exif.php")) require_once("/var/www/html/twinswebsite/includes/exifer1_5/exif.php"); $curr_image = $image_name; if (function_exists('read_exif_data_raw')) { $exif_result = read_exif_data_raw($curr_image,"0"); $orient = $exif_result['IFD0']['Orientation']; // Date and Time if ($orient=='')  $orient = 'bad'; } else $orient = "bad"; return $orient;}?>

From the extracted EXIF data, it was possible to parse the orientation of the image:

theData=`php /var/www/html/twinswebsite/admin/ext_exif_orient.php $destimg`theDegrees=`echo "$theData" | awk '{print $1}'`

If the orientation of the image is portrait, the script is able to rotate the image as appropriate:

if [ $theDegrees == "Normal" ]then echo "$fullimg is landscape no rotation necessary"else theDirection=`echo "$theData" | awk '{print $3}'` if [ $theDirection == "CW" ] then mogrify -rotate 270 $destimg mogrify -rotate 270 $thumbimg else mogrify -rotate 90 $destimg mogrify -rotate 90 $thumbimg fifi

It is amazing sometimes how sheer lazyness pays off in the end. Now instead of images sitting on my camera, I really do not have any excuse not to find images I want to publish, protect them, and have them be published automatically. The entire process end-to-end takes only a few minutes from the time I protect an image to when it is published on my PixelPost photoblog.
-->

Friday, July 29, 2011

Embrace the Black Box

In the past I have sat in meetings where other groups within my company are discussing a project or deliverable from my team and someone (whether it is someone in QA or our customer support team) will say "it works as you said it would, but it all is just a black box to me...". While many times this is said with a sense of frustration (especially by those who are a bit more technical and want to understand what is going on), when I hear the statement it always puts a smile on my face and makes me feel as though what we have delivered is something of quality and was designed and implemented the right way.
In general I think it is incumbent on developers to deliver code that is a black box. It should be part of our DNA. Using a proper level of abstraction is crucial to producing solid code. In some sense the concept of abstraction is comprised of so many other concepts that contribute to solid and reliable software development. To illustrate this point, friend of mine recently reminded me how inseparable the concepts of Abstraction and Separation of Concerns (SoC) are and of course how disastrous it one does not follow this practice.
As Wikipedia's definition so perfectly describes SoC, I will just quote their definition:

Separation of concerns (SoC) is the process of separating a computer program into distinct features that overlap in functionality as little as possible. A concern is any piece of interest or focus in a program. Typically, concerns are synonymous with features or behaviors. Progress towards SoC is traditionally achieved through modularity of programming and encapsulation (or "transparency" of operation), with the help of information hiding.

What is fascinating about the whole concept of a black box in software development is that it can exist on so many levels. As a developer I can provide abstraction in a class by using encapsulation. For example, if I am responsible for a class in our data access layer that writes our user object, anyone that is interacting with that class should really care less what the underlying data structures are that represent our user in the database. If we make changes to how we store that data, proper encapsulation will help to ensure all of the code that interfaces with my class should not have to change if we make a change to the database such as finding a better way to store a user's contact information.
This concept of abstraction should in fact extend further out into an application's implementation. At my current company we currently have custom development that is done by our customer support group. This development may be as simple as writing a report or as complex as working on an internal application that is used by our support analysts to provide customers feedback on issues that are specific to their data. For the same reason that abstraction worked with encapsulation, ensuring we provide a "black box" interface into our application and its data is critical. We usually implement this abstraction using web services. Just as a developer calling my class should not need to know how a user is stored in the database, neither should any external application.
So just remember, the next time someone tells you that something you delivered is a "black box", smile and reply with a simple "thank you".

Thursday, July 28, 2011

The Power of the Oracle SIGN Function and Dates

Although I know this has been written about before, I thought it would still be worthwhile posting how the SIGN PL/SQL function can be inlined within a query to produce some really cool results when dealing with dates. We have a table that tracks a start date and end date for enrollments in our application.

The following table structure is more or less what we use:
SQL> desc APPLICATION_ENROLLMENT;
Name Null? Type
----------------------------------------- -------- ----------------
ENROLLMENT_ID NOT NULL NUMBER
USER_ID NOT NULL NUMBER
REC_START_DT DATE
REC_END_DT DATE

When someone is currently enrolled we set the end date to 12/31/9999. So a very simple query might want to determine which enrollments are currently active. This is easy enough as you can simply query out all of the records whose enrollment end dates are sufficiently in the future (as an aside, we could have also opted to use NULL values in the end date but felt using an actual value would be more efficient for indexing purposes for queries that operate off of both dates when searching for ranges of records that match (e.g., show me all enrollments between these two dates).

e.g., SELECT * FROM APPLICATION_ENROLLMENT WHERE REC_END_DT > sysdate + 365

Yet there are times when we have found that we want in the query to select out a value that is based on determining whether the enrollment is active. Our most recent example came up while trying to satisfy a question regarding how many of our enrollments started or ended in the month of December. We came up with the following very simple query that used the SIGN function to both filter out end dates that were in the future (because you might get a false positive for the month of December) and that allowed us to count the total number of matching enrollments.

select sum( decode( to_number(to_char(rec_start_dt, 'MM')) , 12, 1, 0,  decode( SIGN( rec_end_dt - (sysdate + 365) ), 1, 0,  decode( to_number(to_char(rec_end_dt, 'MM')),  12, 1, 0) ) ) ) december_match from tracked_offender_device

By using the SIGN function we were able to translate the end date into a value (indicating whether it was in the future or not) that that we could then feed to a DECODE such that we could filter those end dates out. The rest of the query simply evaluates whether either the start date or remaining end dates occur in the month of December and if either do it returns a '1' that can then be summed to produce the result.

Tuesday, July 26, 2011

Diagnosing a MySQL CPU Bottleneck on Linux

We currently run our application's ETL stack on Fedora 12 (yes I know we are a bit behind release wise) and encountered what we assumed to be a CPU bottlneck. Specifically we were experiencing a situation where where a process that reads requests out of a queue in a MySQL database, sends these to an external server, and then deletes the requests was backing up. We also observed that when a second unrelated process called he IntegrationProcessController is running (it does not even access the same database), the first process will experience the back up. When the IPC is shut down, the back ups clear.
Our first inclination whenever we run into a bottlneck is to get a mile high picture of the server by looking at the output from top. Sure enough we were able to observe periods where the IPC spikes CPU wise. Certainly not desirable.

top - 15:06:52 up 125 days, 15:53,  3 users,  load average: 0.49, 0.52, 0.46 Tasks: 307 total,   2 running, 305 sleeping,   0 stopped,   0 zombie Mem:  16452452k total,  2931216k used, 13521236k free,   441672k buffers Swap:  8388600k total,        0k used,  8388600k free,  1067508k cached  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 25133 etl      20   0  358m  22m  16m R 124.6  0.1  20:15.88 IntegrationProc 25826 etl      20   0 1714m 144m  16m S 34.5  0.9   2401:06 vtstage1process 2149 mysql     20   0 8397m  90m 7124 S 22.5  0.6  51483:34 mysqld 25822 etl      20   0 9389m  24m 6156 S  7.3  0.2 489:47.40 tdccquemgr

So the next question someone might ask, what exactly does the 124% CPU figure represent? Most servers today of coure have multiple processors and each of those processors is undoubtedly multi-core. There is a file on the operating system (/proc/cpuinfo) that provides all of the information you would need to determine the number of physical CPUs as well as cores layout. The following command will determine the number of physical CPUs.
[fiefdom@srvr]$ cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l2
And the following command will determine how many cores per CPU.
[fiefdom@srvr]$ cat /proc/cpuinfo | grep "cpu cores" | uniq cpu cores : 4
So we have a server that has 2 physical CPUs with 4 cores each and with hyperthreading we get to 16 virtual CPUs. The next step would be to try to determine how those 16 virtual CPUs are being used. When one is viewing the output from top pressing a '1' will break out usage by individual CPU. When I did that for the output above, below is the CPU utilization picture it returned.

top - 15:06:52 up 125 days, 15:53,  3 users,  load average: 0.49, 0.52, 0.46 Tasks: 307 total,   2 running, 305 sleeping,   0 stopped,   0 zombie Cpu0  : 11.2%us,  3.2%sy,  0.0%ni, 84.6%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st Cpu1  : 20.8%us,  2.9%sy,  0.0%ni, 75.2%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st Cpu2  :  9.5%us,  1.0%sy,  0.0%ni, 89.2%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu3  : 11.1%us,  2.3%sy,  0.0%ni, 85.9%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st Cpu4  :  1.6%us,  1.3%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st Cpu5  :  3.3%us,  1.3%sy,  0.0%ni, 95.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu6  :  3.9%us,  0.6%sy,  0.0%ni, 95.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu7  :  2.0%us,  1.3%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu8  :  1.6%us,  1.0%sy,  0.0%ni, 96.8%id,  0.0%wa,  0.0%hi,  0.6%si,  0.0%st Cpu9  :  4.9%us,  1.0%sy,  0.0%ni, 93.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu10 : 67.5%us,  0.3%sy,  0.0%ni, 32.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu11 :  6.6%us,  1.3%sy,  0.0%ni, 91.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu12 :  1.6%us,  1.6%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu13 :  3.6%us,  1.6%sy,  0.0%ni, 94.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu14 : 23.4%us,  0.3%sy,  0.0%ni, 75.9%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu15 :  2.3%us,  0.7%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  16452452k total,  2931216k used, 13521236k free,   441672k buffers Swap:  8388600k total,        0k used,  8388600k free,  1067508k cached  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 25133 etl      20   0  358m  22m  16m R 124.6  0.1  20:15.88 IntegrationProc 25826 etl      20   0 1714m 144m  16m S 34.5  0.9   2401:06 vtstage1process 2149 mysql     20   0 8397m  90m 7124 S 22.5  0.6  51483:34 mysqld 25822 etl      20   0 9389m  24m 6156 S  7.3  0.2 489:47.40 tdccquemgr

Although the Linux man page provides a very bland description of "Report virtual memory statistics", vmstat is in fact a very useful tool that builds upon the output from top to get a better picture of system utilization. During our performance bottlneck, the following was the output we saw from a vmstat run every 2 seconds (keeping in mind the first output represents a summary):

[fiefdom@srvr]$ vmstat 2procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st11  5      0 1293384 505192 45393768    0    0     0    11    1    0 15  3 76  6  0 2  0      0 1299508 505192 45396968    0    0     0     0 35527 17776 27  5 62  6  0 3  2      0 1077540 505192 45401120    0    0     0     8 39603 19231 27  5 59  8  0 4  0      0 736428 505192 45402868    0    0     0    38 24274 9095 25  5 67  3  0 3  0      0 971116 505192 45404452    0    0     0     0 17711 7613 19  4 72  5  0 6  5      0 1396360 505192 45406652    0    0     0    52 19798 8004 26  3 66  5  0 9  9      0 1512024 505192 45410908    0    0     0     0 23491 12507 23  3 66  7  0

There are couple of interesting clues that can be gleaned from the output above. First, there certainly is a spike in CPU for user processes. Much more interesting than that however are the "r" and "b" values. Values in the "r" column indicate the number of processes that were waiting to be run during the polling period. A general rule is that as load increases on the server, more processes get queued and have to wait their turn to get CPU cycles. Even more important are values in the "b" column that represent the number of processes that were "blocked" waiting for I/O during the last polling period. When we ran vmstat against all of our other servers (not exhibiting the bottlneck) the second column virtually always displayed a '0' as it should.

http://ronaldbradford.com/blog/identifying-resource-bottlenecks-cpu-2009-03-31/

http://www.dba-oracle.com/application_server/t_app_server_vmstat.htm
http://gurkulindia.com/main/2011/05/performance-monitoring-vmstat-to-find-performance-bottleneck/
http://khaidoan.wikidot.com/vmstat
http://www.linuxprogrammingblog.com/io-profiling

Monday, July 18, 2011

Requirements Versus Design

I thought it might be worth a quick post to describe a pet peeve of mine. On many of my past projects there is a person (let's call them a functional) whose role it is to gather requirements. In some cases this proves to be invaluable as it is that person who maintains a direct relationship with the customer so that I do not have to. Let's face it, developers in general are not always suited for interacting directly with a customer on a day-to-day basis. We don't like to wear ties, we prefer cans of Red Bull to caffe latte's and in general many of us prefer to be typing at eighty words a minute as opposed to having in depth conversations with a user.
My issue with functionals is that more often than not there is a misunderstanding on that person's part about what a requirement truly is and what it is not. As an example, let's say we are attempting to build a logging subsystem for my company's operations group to monitor the guts of our application. An okay requirement might be "users should have the ability to view 1 month of logs online and older data should be easilty restorable". Yet many times something I see something like the following provided: "we should create a consolidated table in our database for logging that is partitioned by month and also provide a way for our operatons group to easily export/import the table's data".
While I have of course overstated the prior example to a certain degree, it does amaze me how many times a "functional" person will interject implementation or design considerations into a requirement. At first my inclination was to believe this is done because there is a degree of mistrust on that person's part that development will meet the users need or perhaps they thought that the development team is "too busy" or even just that they do not think along the same lines and so to fix that "I'll just put the requirement in terms they will understand". This of course is not the case and if the development team is too busy (they aren't) or dim-witted to handle requirements as fast as they are fed to them then it is probably safe to say that they should be replaced by others who will or you have way too many functionals on your hands.
In reality it should be the development team's job to ask questions based on the requirements to arrive at a design. In the prior example, perhaps the proposed design by development doesn't even involve a database to implement a logging subsystem. Multiple applications may submit to the same logging subsystem and this while to the user they may seem like a "consolidated data store" what goes on behind the sences might be anything but a single table that stores the data. A better approach would be to start with a good set of requirements and continue the analysis by classifying actors (of which the system administrator ueser is one) and use cases. In fact a great place for development and functionals to colaborate would be on the actors and/or use cases - again all of this takes place before the design phase begins.
Having said all of this, a more important and dangerous ramification of this misunderstanding by functionals is that by jumping right to the deisgn, requirements are missed. In some ways it is analgous to a developer who just starts coding w/o working out a design. While a full tracabililty matrix is overkill on most projects today, if you don't set forth a basic set of requirements at the outset how does one know when a project is actually finished? Just like it is too easy for a coder to pull up his IDE of choice and start punching keys, it all too often is much easier to start talking about a design before a solid set of requirements are defined.

Friday, July 15, 2011

Appreciation for One's Technical Contributions

Someone once coined the phrase (more or less paraphrasing) I certainly agree with:

Great leaders learn from being great followers

In many ways knowing what has made me just "click" with my prior managers helps me be a better manager of others. One of the greatest feelings as a follower is knowing your manager that truly appreciates your contributions. This appreciation also I think has to go beyond how it impacts the business or customer base.

For gifted developers, it is important they have managers that have fought themselves in the trenches and have garnered respect from their employees based on their own technical abilities. Ideally a great technical manager can pass still on expertise to even a senior developer.

Even today as I serve as a very senior member of my company's technical arm, it is still important for those above me to be able to appreciate the technical quality of what is produced by my group. Given where I am in my career this does seem to become more and more difficult to find but still is a driving factor determining whether a job is the right one for me.

Monday, July 11, 2011

Mutexes or Locking in MySQL

I ran across this feature in MySQL for a project where we needed the ability to ensure that only a single process was updating an entire MySQL database server at one point in time. We also had to provide this assurance across servers and possibly across servers at multiple locations.
MySQL in fact offers a function called GET_LOCK( str lock_name, num timeout) as well as RELEASE_LOCK(str lock_name). If no other process curently have a lock identified by the name provided, MySQL returns a '1' indicating the lock attempt was successful.

mysql> select get_lock('abc', 10);
+---------------------+
| get_lock('abc', 10) |
+---------------------+
| 1 |
+---------------------+
1 row in set (0.04 sec)

Once locked, no other process can obtain a lock using that lock name until the process holding the lock releases it (using RELEASE_LOCK) or exits and disconnects from MySQL (whether on its own or abnormally). When executing the GET_LOCK SQL where the lock is not free, the SQL enters a continuous re-try loop attempting to obtain the lock for a duration equal to the timeout specified. If not sucessful, the SQL returns '0' which is of course very easy to check as a result.

mysql> select get_lock('abc', 10);
+---------------------+
| get_lock('abc', 10) |
+---------------------+
| 0 |
+---------------------+
1 row in set (10.00 sec)

This capability provded to be all that we needed to build a simple yet elegant way to ensure only a single process could be operating on the database at any given point in time. In addition, the unancipated exception that might cause our process to die would not hold the lock thus allowing subsequent processes the ability to continue on.

Liftoff!

Well this is it. My foray into the world of maintaining a blog for all to see. I really created this blog mainly to have an area where I could post ideas and thoughts that pop into my head based on my day-to-day experience in the world of software development. While that is a very large and ambiguous definition for my blog content I also did not want to set boundaries as far what I could write about.