From the Fiefdom: July 2011

Friday, July 29, 2011

Embrace the Black Box

In the past I have sat in meetings where other groups within my company are discussing a project or deliverable from my team and someone (whether it is someone in QA or our customer support team) will say "it works as you said it would, but it all is just a black box to me...". While many times this is said with a sense of frustration (especially by those who are a bit more technical and want to understand what is going on), when I hear the statement it always puts a smile on my face and makes me feel as though what we have delivered is something of quality and was designed and implemented the right way.
In general I think it is incumbent on developers to deliver code that is a black box. It should be part of our DNA. Using a proper level of abstraction is crucial to producing solid code. In some sense the concept of abstraction is comprised of so many other concepts that contribute to solid and reliable software development. To illustrate this point, friend of mine recently reminded me how inseparable the concepts of Abstraction and Separation of Concerns (SoC) are and of course how disastrous it one does not follow this practice.
As Wikipedia's definition so perfectly describes SoC, I will just quote their definition:

Separation of concerns (SoC) is the process of separating a computer program into distinct features that overlap in functionality as little as possible. A concern is any piece of interest or focus in a program. Typically, concerns are synonymous with features or behaviors. Progress towards SoC is traditionally achieved through modularity of programming and encapsulation (or "transparency" of operation), with the help of information hiding.

What is fascinating about the whole concept of a black box in software development is that it can exist on so many levels. As a developer I can provide abstraction in a class by using encapsulation. For example, if I am responsible for a class in our data access layer that writes our user object, anyone that is interacting with that class should really care less what the underlying data structures are that represent our user in the database. If we make changes to how we store that data, proper encapsulation will help to ensure all of the code that interfaces with my class should not have to change if we make a change to the database such as finding a better way to store a user's contact information.
This concept of abstraction should in fact extend further out into an application's implementation. At my current company we currently have custom development that is done by our customer support group. This development may be as simple as writing a report or as complex as working on an internal application that is used by our support analysts to provide customers feedback on issues that are specific to their data. For the same reason that abstraction worked with encapsulation, ensuring we provide a "black box" interface into our application and its data is critical. We usually implement this abstraction using web services. Just as a developer calling my class should not need to know how a user is stored in the database, neither should any external application.
So just remember, the next time someone tells you that something you delivered is a "black box", smile and reply with a simple "thank you".

Thursday, July 28, 2011

The Power of the Oracle SIGN Function and Dates

Although I know this has been written about before, I thought it would still be worthwhile posting how the SIGN PL/SQL function can be inlined within a query to produce some really cool results when dealing with dates. We have a table that tracks a start date and end date for enrollments in our application.

The following table structure is more or less what we use:
SQL> desc APPLICATION_ENROLLMENT;
Name Null? Type
----------------------------------------- -------- ----------------
ENROLLMENT_ID NOT NULL NUMBER
USER_ID NOT NULL NUMBER
REC_START_DT DATE
REC_END_DT DATE

When someone is currently enrolled we set the end date to 12/31/9999. So a very simple query might want to determine which enrollments are currently active. This is easy enough as you can simply query out all of the records whose enrollment end dates are sufficiently in the future (as an aside, we could have also opted to use NULL values in the end date but felt using an actual value would be more efficient for indexing purposes for queries that operate off of both dates when searching for ranges of records that match (e.g., show me all enrollments between these two dates).

e.g., SELECT * FROM APPLICATION_ENROLLMENT WHERE REC_END_DT > sysdate + 365

Yet there are times when we have found that we want in the query to select out a value that is based on determining whether the enrollment is active. Our most recent example came up while trying to satisfy a question regarding how many of our enrollments started or ended in the month of December. We came up with the following very simple query that used the SIGN function to both filter out end dates that were in the future (because you might get a false positive for the month of December) and that allowed us to count the total number of matching enrollments.

select sum( decode( to_number(to_char(rec_start_dt, 'MM')) , 12, 1, 0,  decode( SIGN( rec_end_dt - (sysdate + 365) ), 1, 0,  decode( to_number(to_char(rec_end_dt, 'MM')),  12, 1, 0) ) ) ) december_match from tracked_offender_device

By using the SIGN function we were able to translate the end date into a value (indicating whether it was in the future or not) that that we could then feed to a DECODE such that we could filter those end dates out. The rest of the query simply evaluates whether either the start date or remaining end dates occur in the month of December and if either do it returns a '1' that can then be summed to produce the result.

Tuesday, July 26, 2011

Diagnosing a MySQL CPU Bottleneck on Linux

We currently run our application's ETL stack on Fedora 12 (yes I know we are a bit behind release wise) and encountered what we assumed to be a CPU bottlneck. Specifically we were experiencing a situation where where a process that reads requests out of a queue in a MySQL database, sends these to an external server, and then deletes the requests was backing up. We also observed that when a second unrelated process called he IntegrationProcessController is running (it does not even access the same database), the first process will experience the back up. When the IPC is shut down, the back ups clear.
Our first inclination whenever we run into a bottlneck is to get a mile high picture of the server by looking at the output from top. Sure enough we were able to observe periods where the IPC spikes CPU wise. Certainly not desirable.

top - 15:06:52 up 125 days, 15:53,  3 users,  load average: 0.49, 0.52, 0.46 Tasks: 307 total,   2 running, 305 sleeping,   0 stopped,   0 zombie Mem:  16452452k total,  2931216k used, 13521236k free,   441672k buffers Swap:  8388600k total,        0k used,  8388600k free,  1067508k cached  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 25133 etl      20   0  358m  22m  16m R 124.6  0.1  20:15.88 IntegrationProc 25826 etl      20   0 1714m 144m  16m S 34.5  0.9   2401:06 vtstage1process 2149 mysql     20   0 8397m  90m 7124 S 22.5  0.6  51483:34 mysqld 25822 etl      20   0 9389m  24m 6156 S  7.3  0.2 489:47.40 tdccquemgr

So the next question someone might ask, what exactly does the 124% CPU figure represent? Most servers today of coure have multiple processors and each of those processors is undoubtedly multi-core. There is a file on the operating system (/proc/cpuinfo) that provides all of the information you would need to determine the number of physical CPUs as well as cores layout. The following command will determine the number of physical CPUs.
[fiefdom@srvr]$ cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l2
And the following command will determine how many cores per CPU.
[fiefdom@srvr]$ cat /proc/cpuinfo | grep "cpu cores" | uniq cpu cores : 4
So we have a server that has 2 physical CPUs with 4 cores each and with hyperthreading we get to 16 virtual CPUs. The next step would be to try to determine how those 16 virtual CPUs are being used. When one is viewing the output from top pressing a '1' will break out usage by individual CPU. When I did that for the output above, below is the CPU utilization picture it returned.

top - 15:06:52 up 125 days, 15:53,  3 users,  load average: 0.49, 0.52, 0.46 Tasks: 307 total,   2 running, 305 sleeping,   0 stopped,   0 zombie Cpu0  : 11.2%us,  3.2%sy,  0.0%ni, 84.6%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st Cpu1  : 20.8%us,  2.9%sy,  0.0%ni, 75.2%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st Cpu2  :  9.5%us,  1.0%sy,  0.0%ni, 89.2%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu3  : 11.1%us,  2.3%sy,  0.0%ni, 85.9%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st Cpu4  :  1.6%us,  1.3%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st Cpu5  :  3.3%us,  1.3%sy,  0.0%ni, 95.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu6  :  3.9%us,  0.6%sy,  0.0%ni, 95.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu7  :  2.0%us,  1.3%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu8  :  1.6%us,  1.0%sy,  0.0%ni, 96.8%id,  0.0%wa,  0.0%hi,  0.6%si,  0.0%st Cpu9  :  4.9%us,  1.0%sy,  0.0%ni, 93.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu10 : 67.5%us,  0.3%sy,  0.0%ni, 32.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu11 :  6.6%us,  1.3%sy,  0.0%ni, 91.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu12 :  1.6%us,  1.6%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu13 :  3.6%us,  1.6%sy,  0.0%ni, 94.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu14 : 23.4%us,  0.3%sy,  0.0%ni, 75.9%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st Cpu15 :  2.3%us,  0.7%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  16452452k total,  2931216k used, 13521236k free,   441672k buffers Swap:  8388600k total,        0k used,  8388600k free,  1067508k cached  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 25133 etl      20   0  358m  22m  16m R 124.6  0.1  20:15.88 IntegrationProc 25826 etl      20   0 1714m 144m  16m S 34.5  0.9   2401:06 vtstage1process 2149 mysql     20   0 8397m  90m 7124 S 22.5  0.6  51483:34 mysqld 25822 etl      20   0 9389m  24m 6156 S  7.3  0.2 489:47.40 tdccquemgr

Although the Linux man page provides a very bland description of "Report virtual memory statistics", vmstat is in fact a very useful tool that builds upon the output from top to get a better picture of system utilization. During our performance bottlneck, the following was the output we saw from a vmstat run every 2 seconds (keeping in mind the first output represents a summary):

[fiefdom@srvr]$ vmstat 2procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st11  5      0 1293384 505192 45393768    0    0     0    11    1    0 15  3 76  6  0 2  0      0 1299508 505192 45396968    0    0     0     0 35527 17776 27  5 62  6  0 3  2      0 1077540 505192 45401120    0    0     0     8 39603 19231 27  5 59  8  0 4  0      0 736428 505192 45402868    0    0     0    38 24274 9095 25  5 67  3  0 3  0      0 971116 505192 45404452    0    0     0     0 17711 7613 19  4 72  5  0 6  5      0 1396360 505192 45406652    0    0     0    52 19798 8004 26  3 66  5  0 9  9      0 1512024 505192 45410908    0    0     0     0 23491 12507 23  3 66  7  0

There are couple of interesting clues that can be gleaned from the output above. First, there certainly is a spike in CPU for user processes. Much more interesting than that however are the "r" and "b" values. Values in the "r" column indicate the number of processes that were waiting to be run during the polling period. A general rule is that as load increases on the server, more processes get queued and have to wait their turn to get CPU cycles. Even more important are values in the "b" column that represent the number of processes that were "blocked" waiting for I/O during the last polling period. When we ran vmstat against all of our other servers (not exhibiting the bottlneck) the second column virtually always displayed a '0' as it should.

http://ronaldbradford.com/blog/identifying-resource-bottlenecks-cpu-2009-03-31/

http://www.dba-oracle.com/application_server/t_app_server_vmstat.htm
http://gurkulindia.com/main/2011/05/performance-monitoring-vmstat-to-find-performance-bottleneck/
http://khaidoan.wikidot.com/vmstat
http://www.linuxprogrammingblog.com/io-profiling

Monday, July 18, 2011

Requirements Versus Design

I thought it might be worth a quick post to describe a pet peeve of mine. On many of my past projects there is a person (let's call them a functional) whose role it is to gather requirements. In some cases this proves to be invaluable as it is that person who maintains a direct relationship with the customer so that I do not have to. Let's face it, developers in general are not always suited for interacting directly with a customer on a day-to-day basis. We don't like to wear ties, we prefer cans of Red Bull to caffe latte's and in general many of us prefer to be typing at eighty words a minute as opposed to having in depth conversations with a user.
My issue with functionals is that more often than not there is a misunderstanding on that person's part about what a requirement truly is and what it is not. As an example, let's say we are attempting to build a logging subsystem for my company's operations group to monitor the guts of our application. An okay requirement might be "users should have the ability to view 1 month of logs online and older data should be easilty restorable". Yet many times something I see something like the following provided: "we should create a consolidated table in our database for logging that is partitioned by month and also provide a way for our operatons group to easily export/import the table's data".
While I have of course overstated the prior example to a certain degree, it does amaze me how many times a "functional" person will interject implementation or design considerations into a requirement. At first my inclination was to believe this is done because there is a degree of mistrust on that person's part that development will meet the users need or perhaps they thought that the development team is "too busy" or even just that they do not think along the same lines and so to fix that "I'll just put the requirement in terms they will understand". This of course is not the case and if the development team is too busy (they aren't) or dim-witted to handle requirements as fast as they are fed to them then it is probably safe to say that they should be replaced by others who will or you have way too many functionals on your hands.
In reality it should be the development team's job to ask questions based on the requirements to arrive at a design. In the prior example, perhaps the proposed design by development doesn't even involve a database to implement a logging subsystem. Multiple applications may submit to the same logging subsystem and this while to the user they may seem like a "consolidated data store" what goes on behind the sences might be anything but a single table that stores the data. A better approach would be to start with a good set of requirements and continue the analysis by classifying actors (of which the system administrator ueser is one) and use cases. In fact a great place for development and functionals to colaborate would be on the actors and/or use cases - again all of this takes place before the design phase begins.
Having said all of this, a more important and dangerous ramification of this misunderstanding by functionals is that by jumping right to the deisgn, requirements are missed. In some ways it is analgous to a developer who just starts coding w/o working out a design. While a full tracabililty matrix is overkill on most projects today, if you don't set forth a basic set of requirements at the outset how does one know when a project is actually finished? Just like it is too easy for a coder to pull up his IDE of choice and start punching keys, it all too often is much easier to start talking about a design before a solid set of requirements are defined.

Friday, July 15, 2011

Appreciation for One's Technical Contributions

Someone once coined the phrase (more or less paraphrasing) I certainly agree with:

Great leaders learn from being great followers

In many ways knowing what has made me just "click" with my prior managers helps me be a better manager of others. One of the greatest feelings as a follower is knowing your manager that truly appreciates your contributions. This appreciation also I think has to go beyond how it impacts the business or customer base.

For gifted developers, it is important they have managers that have fought themselves in the trenches and have garnered respect from their employees based on their own technical abilities. Ideally a great technical manager can pass still on expertise to even a senior developer.

Even today as I serve as a very senior member of my company's technical arm, it is still important for those above me to be able to appreciate the technical quality of what is produced by my group. Given where I am in my career this does seem to become more and more difficult to find but still is a driving factor determining whether a job is the right one for me.

Monday, July 11, 2011

Mutexes or Locking in MySQL

I ran across this feature in MySQL for a project where we needed the ability to ensure that only a single process was updating an entire MySQL database server at one point in time. We also had to provide this assurance across servers and possibly across servers at multiple locations.
MySQL in fact offers a function called GET_LOCK( str lock_name, num timeout) as well as RELEASE_LOCK(str lock_name). If no other process curently have a lock identified by the name provided, MySQL returns a '1' indicating the lock attempt was successful.

mysql> select get_lock('abc', 10);
+---------------------+
| get_lock('abc', 10) |
+---------------------+
| 1 |
+---------------------+
1 row in set (0.04 sec)

Once locked, no other process can obtain a lock using that lock name until the process holding the lock releases it (using RELEASE_LOCK) or exits and disconnects from MySQL (whether on its own or abnormally). When executing the GET_LOCK SQL where the lock is not free, the SQL enters a continuous re-try loop attempting to obtain the lock for a duration equal to the timeout specified. If not sucessful, the SQL returns '0' which is of course very easy to check as a result.

mysql> select get_lock('abc', 10);
+---------------------+
| get_lock('abc', 10) |
+---------------------+
| 0 |
+---------------------+
1 row in set (10.00 sec)

This capability provded to be all that we needed to build a simple yet elegant way to ensure only a single process could be operating on the database at any given point in time. In addition, the unancipated exception that might cause our process to die would not hold the lock thus allowing subsequent processes the ability to continue on.

Liftoff!

Well this is it. My foray into the world of maintaining a blog for all to see. I really created this blog mainly to have an area where I could post ideas and thoughts that pop into my head based on my day-to-day experience in the world of software development. While that is a very large and ambiguous definition for my blog content I also did not want to set boundaries as far what I could write about.