Linux IO monitoring per file?

I am interested in a utility or process for monitoring disk IO per file on CentOS.

On Win2008, the resmon utility allows this type of drilldown, but none of the Linux utilities I have found do this (iostat, iotop, dstat, nmon).

My interest in monitoring IO bottlenecks on database servers. With MSSQL, I have found it an informative diagnostic to know which files / filespaces are getting hit the hardest.


SystemTap is probably your best option.

Here is how the output from the iotime.stp example looks like:

825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
[...]
117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
[...]
3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
3973744 2886 (sendmail) iotime /proc/loadavg time: 11

The disadvantage (aside from the learning curve) is that you will need to install kernel-debug, which may not be possible on a production system. However, you can resort to cross-instrumentation where you compile a module on a development system, and run that .ko on the production system.

Or if you are impatient, look at Chapter 4. Useful SystemTap Scripts from the beginners guide.


SystemTap* script:

#!/usr/bin/env stap
#
# Monitor the I/O of a program.
#
# Usage:
#   ./monitor-io.stp name-of-the-program

global program_name = @1

probe begin {
  printf("%5s %1s %6s %7s %s\n",
         "PID", "D", "BYTES", "us", "FILE")
}

probe vfs.read.return, vfs.write.return {
  # skip other programs
  if (execname() != program_name)
    next

  if (devname=="N/A")
    next

  time_delta = gettimeofday_us() - @entry(gettimeofday_us())
  direction = name == "vfs.read" ? "R" : "W"

  # If you want only the filename, use
  // filename = kernel_string($file->f_path->dentry->d_name->name)
  # If you want only the path from the mountpoint, use
  // filename = devname . "," . reverse_path_walk($file->f_path->dentry)
  # If you want the full path, use
  filename = task_dentry_path(task_current(),
                              $file->f_path->dentry,
                              $file->f_path->mnt)

  printf("%5d %1s %6d %7d %s\n",
         pid(), direction, $return, time_delta, filename)
}

The output looks like this:

[root@sl6 ~]# ./monitor-io.stp cat
PID D  BYTES      us FILE
3117 R    224       2 /lib/ld-2.12.so
3117 R    512       3 /lib/libc-2.12.so
3117 R  17989     700 /usr/share/doc/grub-0.97/COPYING
3117 R      0       3 /usr/share/doc/grub-0.97/COPYING

Or if you choose to display only the path from the mountpoint:

[root@f19 ~]# ./monitor-io.stp cat
  PID D  BYTES      us FILE
26207 R    392       4 vda3,usr/lib64/ld-2.17.so
26207 R    832       3 vda3,usr/lib64/libc-2.17.so
26207 R   1758       4 vda3,etc/passwd
26207 R      0       1 vda3,etc/passwd
26208 R    392       3 vda3,usr/lib64/ld-2.17.so
26208 R    832       3 vda3,usr/lib64/libc-2.17.so
26208 R  35147      16 vdb7,ciupicri/doc/grub2-2.00/COPYING
26208 R      0       2 vdb7,ciupicri/doc/grub2-2.00/COPYING

[root@f19 ~]# mount | grep -E '(vda3|vdb7)'
/dev/vda3 on / type ext4 (rw,relatime,seclabel,data=ordered)
/dev/vdb7 on /mnt/mnt1/mnt11/data type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

Limitations/bugs:

  • mmap based I/O doesn't show up because devname is "N/A"
  • files on tmpfs don't show up because devname is "N/A"
  • it doesn't matter if the reads are from the cache or the writes are to the buffer

The results for Matthew Ife's programs:

  • for mmaptest private:

     PID D  BYTES      us FILE
    3140 R    392       5 vda3,usr/lib64/ld-2.17.so
    3140 R    832       5 vda3,usr/lib64/libc-2.17.so
    3140 W     23       9 N/A,3
    3140 W     23       4 N/A,3
    3140 W     17       3 N/A,3
    3140 W     17     118 N/A,3
    3140 W     17     125 N/A,3
    
  • for mmaptest shared:

     PID D  BYTES      us FILE
    3168 R    392       3 vda3,usr/lib64/ld-2.17.so
    3168 R    832       3 vda3,usr/lib64/libc-2.17.so
    3168 W     23       7 N/A,3
    3168 W     23       2 N/A,3
    3168 W     17       2 N/A,3
    3168 W     17      98 N/A,3
    
  • for diotest (direct I/O):

     PID D  BYTES      us FILE
    3178 R    392       2 vda3,usr/lib64/ld-2.17.so
    3178 R    832       3 vda3,usr/lib64/libc-2.17.so
    3178 W     16       6 N/A,3
    3178 W 1048576     509 vda3,var/tmp/test_dio.dat
    3178 R 1048576     244 vda3,var/tmp/test_dio.dat
    3178 W     16      25 N/A,3
    

*Quick setup instructions for RHEL 6 or equivalent: yum install systemtap and debuginfo-install kernel


You'd actually want to use blktrace for this. See Visualizing Linux IO with Seekwatcher and blktrace.

I'll see if I can post one of my examples soon.


Edit:

You don't mention distribution of Linux, but maybe this is a good case for a dtrace script on Linux or even System Tap, if you're using a RHEL-like system.