How to determine what's causing my server's load average to jump to 90
Alrighty, I'm at a complete loss here. I've had this Ubuntu server running for about three years now. In the last couple months it started behaving oddly and it's only getting worse. It's a pretty busy server running around 15 websites and a number of other tools on it. It's typical 15min load avg is .3. However, its' been spiking to around 90 about every 12 hours or so.
I'm certain that is has something to do with mysql and the server somehow getting locked and apache just pilling up waiting for things to open. Here is a top when things are going crazy.
Tasks: 143 total, 20 running, 123 sleeping, 0 stopped, 0 zombie
Cpu(s): 34.3%us, 62.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.2%hi, 2.6%si, 0.0%st
Mem: 2061444k total, 911460k used, 1149984k free, 11156k buffers
Swap: 1421712k total, 0k used, 1421712k free, 126728k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1080 mysql 20 0 397m 59m 5892 S 18 3.0 0:37.37 mysqld
1602 www-data 20 0 198m 26m 4948 R 7 1.3 0:08.17 apache2
1725 www-data 20 0 189m 24m 11m R 7 1.2 0:04.33 apache2
1719 www-data 20 0 189m 25m 12m R 7 1.2 0:03.88 apache2
1802 www-data 20 0 192m 20m 4808 S 7 1.0 0:03.15 apache2
1521 www-data 20 0 199m 28m 6912 R 6 1.4 0:10.15 apache2
1530 www-data 20 0 193m 22m 5104 S 5 1.1 0:06.53 apache2
1536 www-data 20 0 196m 25m 4936 R 5 1.2 0:07.93 apache2
1583 www-data 20 0 186m 21m 11m R 5 1.0 0:03.46 apache2
1722 www-data 20 0 193m 21m 4956 R 5 1.1 0:04.91 apache2
1906 www-data 20 0 182m 12m 6724 S 5 0.6 0:00.61 apache2
1439 root 20 0 92040 3672 2280 S 5 0.2 0:08.04 ezproxy
1539 www-data 20 0 194m 27m 9548 R 4 1.3 0:08.08 apache2
1716 www-data 20 0 187m 22m 11m R 4 1.1 0:03.36 apache2
1891 www-data 20 0 183m 18m 11m S 4 0.9 0:00.61 apache2
1498 www-data 20 0 194m 23m 6264 S 4 1.2 0:11.47 apache2
1517 www-data 20 0 193m 22m 5212 R 4 1.1 0:06.56 apache2
1523 www-data 20 0 190m 26m 12m S 3 1.3 0:07.61 apache2
1761 www-data 20 0 186m 20m 10m R 2 1.0 0:02.66 apache2
1779 www-data 20 0 184m 19m 10m R 2 0.9 0:02.69 apache2
1711 www-data 20 0 185m 20m 11m R 2 1.0 0:03.32 apache2
1728 www-data 20 0 182m 11m 5028 R 2 0.6 0:01.14 apache2
1819 www-data 20 0 181m 8120 3332 S 2 0.4 0:00.49 apache2
1886 www-data 20 0 182m 11m 6364 S 2 0.6 0:01.18 apache2
1899 www-data 20 0 184m 18m 10m S 2 0.9 0:01.38 apache2
1497 www-data 20 0 191m 27m 12m S 1 1.4 0:07.84 apache2
1766 www-data 20 0 181m 10m 5016 R 1 0.5 0:01.39 apache2
1871 www-data 20 0 184m 19m 11m R 1 1.0 0:00.98 apache2
1563 www-data 20 0 186m 23m 13m S 1 1.2 0:07.37 apache2
1865 www-data 20 0 184m 18m 10m S 1 0.9 0:01.56 apache2
1494 www-data 20 0 193m 25m 8352 S 1 1.3 0:12.07 apache2
1512 www-data 20 0 186m 23m 13m R 1 1.1 0:06.10 apache2
1526 www-data 20 0 186m 24m 13m R 1 1.2 0:06.30 apache2
1816 www-data 20 0 184m 18m 10m S 1 0.9 0:01.60 apache2
1516 www-data 20 0 184m 19m 11m S 1 1.0 0:04.12 apache2
Right now, things are running calmly,
Uptime: 241264 Threads: 1 Questions: 1870412 Slow queries: 1354 Opens: 13818 Flush tables: 1 Open tables: 256 Queries per second avg: 7.752
Here is all of my db sizes in MB
name1 14.78335094
name2 11.08541870
name3 31.01449203
name4 6.24377346
name5 0.36655807
name6 10.95312500
information_schema 0.00781250
mysql 0.60296535
name7 2.19595051
name8 1.82343006
name9 20.51372623
name0 59.42693043
I checked the slow query log but when the lockup happens every query is dumped into the slow query log. I haven't been in the server when it happens to run a proccesslist. Is there anything else I can do besides that?
Update: Here is the output from the tuning-primer.sh script: https://gist.github.com/913565
Update: Here is an IOStat during a freakout:
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 5.25 6.05 106.35 3090763 54314928
And a vmstat 3: https://gist.github.com/913565#file_vmstat%203
Now with more SAR! https://gist.github.com/913565#file_sar
Thanks for the help.
Try installing sar
and running it in the background. You may have a disk load which is spiking. sar
will let you see what resources have the heaviest loads when thing go wrong like this.
You high sys
load may indicate that you have a lot of I/O happening. This may be a result of natural growth of the database. Do you have an archiving process in place, to remove old data from the databases? If not you will reach a point where data required for table scans no longer fits in memory. When this happens performance will tank suddenly and significantly. The slow queries log may include some queries which can be improved by the addition of an index.
If you have another system that can you run munin
on, you may want to install munin-node
on the server. This will give you graphical output of some of the data available from sar
. Check on the graphs every so often to see if things are changing.
EDIT: It looks like you may have a memory leak in some code running under apache. Try setting MaxRequestsPerChild
to around 100 and restarting apache. If that fixes your problem, try to find your memory leak.
Your database size is in MB, right? This is fairly small and should nearly stick completely to the configured amount of memory, so i don't think that mysql is the problem here. Could you please post the output of MySQL Tuning Primer anyway? In addition you should definitely something like munin/cacti/.. to graph and collect data about your system. What kind of software is runninng your machines? php stuff? Are you already using a opcode cache like APC?