Why are cgroups (blkio serviced bytes) and iotop producing diverging results

The very, very bottom of the kernel documentation on blkio controller includes the note:

What works

  • Currently only sync IO queues are support. All the buffered writes are still system wide and not per group. Hence we will not see service differentiation between buffered writes between groups.

Practically, this means that write operations will appear in blkio.throttle.io_service_bytes only if they bypass kernel buffering.

The tool fio can illustrate this very easily. Direct, unbuffered writes should be reported in blkio.throttle.io_service_bytes:

fio --name wxyz --direct=1 --buffered=0 --size=1g --time_based --runtime=120s --bs=4k --rw=write --ioengine=sync --numjobs=1 

Whereas with the opposite direct & buffered options, there is nothing reported in blkio.throttle.io_service_bytes, because writes pass through the kernel buffer cache and are scheduled later.

fio --name wxyz --direct=0 --buffered=1 --size=1g --time_based --runtime=120s --bs=4k --rw=write --ioengine=sync --numjobs=1

Additionally, this thread with a RedHat engineer who works on cgroups reiterates the point that once a write has passed to the write cache inside the kernel, "Due to this extra layer of cache, we lose the context information by the time IO reaches the device." And, so no accounting can occur by blkio.