Evaluating the CPU I/O wait on Linux
Doing a top
to check the io wait, I get these figures:
Cpu(s): 6.7%us, 1.4%sy, 1.2%ni, 85.5%id, 5.0%wa, 0.0%hi, 0.3%si, 0.0%st
Looking at these figures (%us ~= %wa), do they mean that:
- there are almost as many CPU processes waiting than working? (=> bad)
- the working processeses are waiting 5,0% of their execution plan? (=> ok in this case)
- something else
Solution 1:
You need to be careful when evaluating these figures.
- IOWait is related, but not necessarily linearly correlated with disk activity.
- The number of CPUs you have affects your percentage.
- A high IOWait (depending on your application) does not necessarily indicate a problem for you. Alternatively a small IOWait may translate into a problem for you. It basically boils down to what task is waiting.
IOWait in this context is the measure of time over a given period that a CPU (or all CPUS) spent idle because all runnable tasks were waiting for a IO operation to be fulfilled.
In your example, if you have 20 CPUs, with one task really hammering the disk, this task is (in effect) spending 100% of its time in IOWait, subsequently the CPU that this task runs on spends almost 100% of its time in IOWait. However, if 19 other CPUs are effectively idle and not utilizing this disk, they report 0% IOWait. This results in an averaged IOWait percentage of 5%, when in fact if you were to peek at your disk utilization this could report 100%. If the application waiting on disk is critical to you -- this 5% is somewhat misleading because the task in the bottleneck is seeing likely much higher performance issues than going 5% slow.
there are almost as many CPU processes waiting than working? (=> bad)
Probably, remember for the most part CPUs run tasks and tasks are what request IO. If two separate tasks are busy querying the same disk on two separate CPUs, this will put both CPUs at 100% IOWait (and in the 20 CPU example a 10% overall average IOWait).
Basically if you have a lot of tasks that request IO, especially from the same disk, plus that disk is 100% utilized (see iostat -mtx
) then this is bad.
the working processeses are waiting 5,0% of their execution plan? (=> ok in this case)
No. The working processes are almost certainly waiting full-time for IO. It's just the average report case ("the other CPUs are not busy") fudges the percentage or the fact that the CPU has many tasks to run, of which many don't need to do IO.
As a general rule, on a multi-CPU system, an IOWait percentage which is equal to the number of CPUs you have divided by a 100 is probably something to investigate.
something else
See above. But note that applications that do very heavy writing are throttled (stop using writeback, start writing directly to disk). This causes those tasks to produce high IOWait whilst other tasks on the same CPU writing to the same disk would not. So exceptions do exist.
Also note if you have 1 CPU dedicated to running 2 tasks, one is a heavy IO read/writer and the other is a heavy CPU user, then the CPU will report 50% IOWait in this case, if you have 10 tasks like this it would be 10% IOWait (and a horrific load), so the number can be reported much lower than what might actually be a problem.
I think you really need to take a look at iostat -mtx
to get some disk utilization metrics, and pidstat -d
to get some per-process metrics, then consider whether or not the applications hitting those disks in that way are likely to cause a problem, or other potential applications that hit those disks being likely to cause a problem.
CPU metrics really act as indicators to underlying issues, they are general so understanding where they may be too general is a good thing.