We currently run our application's ETL stack on Fedora 12 (yes I know we are a bit behind release wise) and encountered what we assumed to be a CPU bottlneck. Specifically we were experiencing a situation where where a process that reads requests out of a queue in a MySQL database, sends these to an external server, and then deletes the requests was backing up. We also observed that when a second unrelated process called he IntegrationProcessController is running (it does not even access the same database), the first process will experience the back up. When the IPC is shut down, the back ups clear.
Our first inclination whenever we run into a bottlneck is to get a mile high picture of the server by looking at the output from top. Sure enough we were able to observe periods where the IPC spikes CPU wise. Certainly not desirable.
top - 15:06:52 up 125 days, 15:53, 3 users, load average: 0.49, 0.52, 0.46 Tasks: 307 total, 2 running, 305 sleeping, 0 stopped, 0 zombie Mem: 16452452k total, 2931216k used, 13521236k free, 441672k buffers Swap: 8388600k total, 0k used, 8388600k free, 1067508k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25133 etl 20 0 358m 22m 16m R 124.6 0.1 20:15.88 IntegrationProc 25826 etl 20 0 1714m 144m 16m S 34.5 0.9 2401:06 vtstage1process 2149 mysql 20 0 8397m 90m 7124 S 22.5 0.6 51483:34 mysqld 25822 etl 20 0 9389m 24m 6156 S 7.3 0.2 489:47.40 tdccquemgr
So the next question someone might ask, what exactly does the 124% CPU figure represent? Most servers today of coure have multiple processors and each of those processors is undoubtedly multi-core. There is a file on the operating system (/proc/cpuinfo) that provides all of the information you would need to determine the number of physical CPUs as well as cores layout. The following command will determine the number of physical CPUs.
[fiefdom@srvr]$ cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l2
And the following command will determine how many cores per CPU.
[fiefdom@srvr]$ cat /proc/cpuinfo | grep "cpu cores" | uniq cpu cores : 4
So we have a server that has 2 physical CPUs with 4 cores each and with hyperthreading we get to 16 virtual CPUs. The next step would be to try to determine how those 16 virtual CPUs are being used. When one is viewing the output from top pressing a '1' will break out usage by individual CPU. When I did that for the output above, below is the CPU utilization picture it returned.
top - 15:06:52 up 125 days, 15:53, 3 users, load average: 0.49, 0.52, 0.46 Tasks: 307 total, 2 running, 305 sleeping, 0 stopped, 0 zombie Cpu0 : 11.2%us, 3.2%sy, 0.0%ni, 84.6%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st Cpu1 : 20.8%us, 2.9%sy, 0.0%ni, 75.2%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st Cpu2 : 9.5%us, 1.0%sy, 0.0%ni, 89.2%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 11.1%us, 2.3%sy, 0.0%ni, 85.9%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st Cpu4 : 1.6%us, 1.3%sy, 0.0%ni, 96.4%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st Cpu5 : 3.3%us, 1.3%sy, 0.0%ni, 95.1%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu6 : 3.9%us, 0.6%sy, 0.0%ni, 95.1%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu7 : 2.0%us, 1.3%sy, 0.0%ni, 96.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu8 : 1.6%us, 1.0%sy, 0.0%ni, 96.8%id, 0.0%wa, 0.0%hi, 0.6%si, 0.0%st Cpu9 : 4.9%us, 1.0%sy, 0.0%ni, 93.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu10 : 67.5%us, 0.3%sy, 0.0%ni, 32.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 6.6%us, 1.3%sy, 0.0%ni, 91.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu12 : 1.6%us, 1.6%sy, 0.0%ni, 96.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu13 : 3.6%us, 1.6%sy, 0.0%ni, 94.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu14 : 23.4%us, 0.3%sy, 0.0%ni, 75.9%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu15 : 2.3%us, 0.7%sy, 0.0%ni, 97.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16452452k total, 2931216k used, 13521236k free, 441672k buffers Swap: 8388600k total, 0k used, 8388600k free, 1067508k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25133 etl 20 0 358m 22m 16m R 124.6 0.1 20:15.88 IntegrationProc 25826 etl 20 0 1714m 144m 16m S 34.5 0.9 2401:06 vtstage1process 2149 mysql 20 0 8397m 90m 7124 S 22.5 0.6 51483:34 mysqld 25822 etl 20 0 9389m 24m 6156 S 7.3 0.2 489:47.40 tdccquemgr
Although the Linux man page provides a very bland description of "Report virtual memory statistics", vmstat is in fact a very useful tool that builds upon the output from top to get a better picture of system utilization. During our performance bottlneck, the following was the output we saw from a vmstat run every 2 seconds (keeping in mind the first output represents a summary):
[fiefdom@srvr]$ vmstat 2procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st11 5 0 1293384 505192 45393768 0 0 0 11 1 0 15 3 76 6 0 2 0 0 1299508 505192 45396968 0 0 0 0 35527 17776 27 5 62 6 0 3 2 0 1077540 505192 45401120 0 0 0 8 39603 19231 27 5 59 8 0 4 0 0 736428 505192 45402868 0 0 0 38 24274 9095 25 5 67 3 0 3 0 0 971116 505192 45404452 0 0 0 0 17711 7613 19 4 72 5 0 6 5 0 1396360 505192 45406652 0 0 0 52 19798 8004 26 3 66 5 0 9 9 0 1512024 505192 45410908 0 0 0 0 23491 12507 23 3 66 7 0
There are couple of interesting clues that can be gleaned from the output above. First, there certainly is a spike in CPU for user processes. Much more interesting than that however are the "r" and "b" values. Values in the "r" column indicate the number of processes that were waiting to be run during the polling period. A general rule is that as load increases on the server, more processes get queued and have to wait their turn to get CPU cycles. Even more important are values in the "b" column that represent the number of processes that were "blocked" waiting for I/O during the last polling period. When we ran vmstat against all of our other servers (not exhibiting the bottlneck) the second column virtually always displayed a '0' as it should.
http://ronaldbradford.com/blog/identifying-resource-bottlenecks-cpu-2009-03-31/
http://www.dba-oracle.com/application_server/t_app_server_vmstat.htm
http://gurkulindia.com/main/2011/05/performance-monitoring-vmstat-to-find-performance-bottleneck/
http://khaidoan.wikidot.com/vmstat
http://www.linuxprogrammingblog.com/io-profiling
No comments:
Post a Comment