Interpreting read, write and total IO time in /proc/diskstats

I've noticed that when I look at the output of /proc/diskstats there is a discrepancy between the total time spent reading, the total time spent writing, and the total time performing IO. For example, I saw an entry in /proc/diskstats that was:

$ cat /proc/diskstats
...
8       0 sda 944150584 590524 235547588959 780672196 833280352 534699043 322507689696 3472000824 1 812190100 4246357772
...

According to the documentation at https://www.kernel.org/doc/Documentation/iostats.txt,

Field 4 -- # of milliseconds spent reading This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last()).

Field 8 -- # of milliseconds spent writing This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()).

Field 10 -- # of milliseconds spent doing I/Os This field increases so long as field 9 is nonzero.

Consequently, I would expect that the tenth field would be the sum of the fourth and eight fields as I would expect the total IO time to be the sum of the time spent reading and the time spent writing. However, I've never noticed this to be the case, and what's more I've always observed the sum of the fourth and eighth fields to be greater than the tenth (for example, in the line above (780672196 + 3472000824 - 812190100 = 3440482920). I was wondering if anyone could explain why these numbers are different, it seems the tenth field is trying to capture something different from just the sum of the fourth and eighth fields.


I've not looked at the source code, but it seems the difference stems from two different accounting modes.

The #4 and #8 fields sum the time each request take to complete. This means that parallel-issued requests still each contribute to make the count grow.

The #10 field only count the actual time the queue and disks were busy, so they count parallel-issued requests as a single one.

Let do a practical example. On by /boot partition, I dd a ~4 MB file. Give a look at the stats:

[root@localhost boot]# cat /proc/diskstats | grep sda1
   8       1 sda1 46256 0 255703 19332 2063 0 4162 538 0 11207 19862
[root@localhost boot]# dd if=initramfs-0-rescue-7dc32e3935ba4ce1ae50a0a8170e4480.img of=/dev/null
84099+1 records in
84099+1 records out
43058701 bytes (43 MB) copied, 0.347783 s, 124 MB/s
[root@localhost boot]# cat /proc/diskstats | grep sda1
   8       1 sda1 46342 0 339807 23011 2063 0 4162 538 0 11551 23540
[root@localhost boot]#

Reading the file required ~0.35s, or ~350ms. However, counter #4 and #10 react in very different manner: the first increase by about 4000, while the latter only by about 350. It's easy to see which have the "correct" value: it's field #10, as we know by dd that the entire operation took about 350ms and field #10 increased by this very same amount.

So, why field #4 increased so much, and what it really measure?

First, lets understand what's happening at the request level. dd uses, by default, 512B requests, but the linux pagecache works at a granularity of 4KB, so we should expect about 1000 x 4KB requests. These 1000 requests are put in a queue and issued, one by one (for simplicity, let's pretend NCQ does not exist) and dispatched to the disk. As mechanical disks are quite good at sequential reads, they generally uses a read-ahead policy - ie: they read more data than required. This means that, after the first 4K request completes, all other subsequent request will be served in a very short time.

Let's do some math, with wild semplification:

  • number of requests: 1000 (all request enter the queue at the same time)
  • time for first request to complete: 4ms
  • time for all subsequent request to complete: 0ms (yeah, I told you this is a wild semplification)

Total requests time as measure by field #4: 1000 in-queue request * 4ms each == 4000ms. And this roughly is the value by which field #4 increased...

Bottom line:

  • fields #4 and #8 measure the time taken to serve I/O requests from the point of view of a single request, multiplying this by the requests number
  • field #10 tell you the real elapsed time ("wall clock") used in the I/O path

To draw an approssimative parallelism: think to multi-core CPU. Two processes can concurrently hammer the CPU, executing for 60s each. The total CPU time used is 120s (60s*2), but real elapsed clock time remains at 60s total, as the two processes run concurrently.