wa (Waiting for I/O) from top command is big

I have a forum with a lot of visitors, Some days the load increase to reach 40 without increase of the number vistors. As you can see from the below output, the waiting time is high (57%). how do I find the reason for that?
The server software is Apache, MySQL and PHP.

root@server:~# top
top - 13:22:08 up 283 days, 22:06,  1 user,  load average: 13.84, 24.75, 22.79
Tasks: 333 total,   1 running, 331 sleeping,   0 stopped,   1 zombie
Cpu(s): 20.6%us,  7.9%sy,  0.0%ni, 13.4%id, 57.1%wa,  0.1%hi,  0.9%si,  0.0%st
Mem:   4053180k total,  3868680k used,   184500k free,   136380k buffers
Swap:  9936160k total,    12144k used,  9924016k free,  2166552k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23930 mysql     20   0  549m 122m 6580 S   90  3.1   4449:04 mysqld
17422 www-data  20   0  223m  20m  10m S    2  0.5   0:00.21 apache2
17555 www-data  20   0  222m  19m 9968 S    2  0.5   0:00.13 apache2
17264 www-data  20   0  225m  19m 8972 S    1  0.5   0:00.17 apache2
17251 www-data  20   0  220m  12m 4912 S    1  0.3   0:00.12 apache2

.

root@server:~# top
top - 13:39:59 up 283 days, 22:24,  1 user,  load average: 6.66, 10.39, 13.95
Tasks: 318 total,   1 running, 317 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.6%us,  4.2%sy,  0.0%ni, 40.5%id, 40.6%wa,  0.2%hi,  0.8%si,  0.0%st
Mem:   4053180k total,  4010992k used,    42188k free,   119544k buffers
Swap:  9936160k total,    12160k used,  9924000k free,  2290716k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23930 mysql     20   0  549m 122m 6580 S   44  3.1   4457:30 mysqld
19946 www-data  20   0  223m  21m  10m S    5  0.6   0:00.77 apache2
17316 www-data  20   0  226m  23m  11m S    1  0.6   0:01.76 apache2
17333 www-data  20   0  222m  21m  11m S    1  0.5   0:01.55 apache2
18212 www-data  20   0  225m  22m  11m S    1  0.6   0:01.58 apache2
19528 www-data  20   0  220m  13m 5480 S    1  0.3   0:00.63 apache2
19600 www-data  20   0  224m  20m  11m S    1  0.5   0:00.73 apache2
19942 www-data  20   0  225m  21m  10m S    1  0.5   0:00.82 apache2
20232 www-data  20   0  222m  16m 8760 S    1  0.4   0:00.65 apache2
20243 www-data  20   0  223m  21m  11m S    1  0.5   0:00.57 apache2
20299 www-data  20   0  225m  20m   9m S    1  0.5   0:00.67 apache2
20441 www-data  20   0  225m  21m  10m S    1  0.5   0:00.57 apache2
21201 www-data  20   0  220m  12m 5148 S    1  0.3   0:00.19 apache2
21362 www-data  20   0  220m  12m 5032 S    1  0.3   0:00.17 apache2
21364 www-data  20   0  220m  12m 4916 S    1  0.3   0:00.14 apache2
21366 www-data  20   0  220m  12m 5124 S    1  0.3   0:00.22 apache2
21373 www-data  20   0  222m  14m 7060 S    1  0.4   0:00.26 apache2

Solution 1:

Here are a few tools to find disk activity:

  • iotop
  • vmstat 1
  • iostat 1
  • lsof
  • strace -e trace=open <application>
  • strace -e trace=open -p <pid>

In ps auxf you'll also see which processes are are in uninterruptible disk sleep (D) because they are waiting for I/O.

Some days the load increase to reach 40 without increase of the number vistors.

You may also want to create a backup, and see if the harddrive is slowly failing. A harddrive generally starts to slow down before it deceases. This could also explain the high load.

Solution 2:

The output from top suggests that the DBMS is experiencing most of the I/O waits, so database tuning issues are an obvious candidate to investigate.

I/O waiting on a database server - particularly on load spikes - is a clue that your DBMS might be either disk bound (i.e. you need a faster disk subsystem) or it might have a tuning issue. You should probably also look into profiling your database server - i.e. get a trace of what it's doing and what queries are taking the time.

Some starter points for diagnising database tuning issues:-

  • Find the queries that take up the most time, and look at the query plans. See if any have odd query plans such as a table scan where it shouldn't be. Maybe the database needs an index added.

  • Long resource wait times may mean that some key resource pool needs to be expanded.

  • Long I/O wait times may mean that you need a faster disk subsystem.

  • Are your log and data volumes on separate drives? Database logs have a lot of small sequential writes (essentially they behave like a ring buffer). If you have a busy random access workload sharing the same disks as your logs this will disporportionately affect the throughput of the logging. For a database transaction to commit the log entries must be written out to disk, so this will place a bottleneck on the whole system.

    Note that some MySQL storage engines don't use logs so this may not be an issue in your case.

Footnote: Queuing systems

Queuing systems (a statistical model for throughput) get hyperbolically slower as the system approaches saturation. For a high level approximation, a system that is 50% saturated has an average queue length of 2. A system that is 90% saturated has a queue length of 10, a system that is 99% saturated has a queue length of 100.

Thus, on a system that is close to saturation, small changes in load can result in large changes to wait times, in this case manifesting as time spent waiting on I/O. If the I/O capacity of your disk subsystem is nearly saturated then small changes in load can result in significant changes in response times.