High I/O latency with software RAID, LUKS encrypted and LVM partitioned KVM setup

I found out a performance problems with a Mumble server, which I described in a previous question are caused by an I/O latency problem of unknown origin. As I have no idea what is causing this and how to further debug it, I'm asking for your ideas on the topic.

I'm running a Hetzner EX4S root server as KVM hypervisor. The server is running Debian Wheezy Beta 4 and KVM virtualisation is utilized through LibVirt.

The server has two different 3TB hard drives as one of the hard drives was replaced after S.M.A.R.T. errors were reported. The first hard disk is a Seagate Barracuda XT ST33000651AS (512 bytes logical, 4096 bytes physical sector size), the other one a Seagate Barracuda 7200.14 (AF) ST3000DM001-9YN166 (512 bytes logical and physical sector size). There are two Linux software RAID1 devices. One for the unencrypted boot partition and one as container for the encrypted rest, using both hard drives.

Inside the latter RAID device lies an AES encrypted LUKS container. Inside the LUKS container there is a LVM physical volume. The hypervisor's VFS is split on three logical volumes on the described LVM physical volume: one for /, one for /home and one for swap.

Here is a diagram of the block device configuration stack:

sda (Physical HDD)
- md0 (RAID1)
- md1 (RAID1)

sdb (Physical HDD)
- md0 (RAID1)
- md1 (RAID1)

md0 (Boot RAID)
- ext4 (/boot)

md1 (Data RAID)
- LUKS container
  - LVM Physical volume
    - LVM volume hypervisor-root
    - LVM volume hypervisor-home
    - LVM volume hypervisor-swap
    - … (Virtual machine volumes)

The guest systems (virtual machines) are mostly running Debian Wheezy Beta 4 too. We have one additional Ubuntu Precise instance. They get their block devices from the LVM physical volume, too. The volumes are accessed through Virtio drivers in native writethrough mode. The IO scheduler (elevator) on both the hypervisor and the guest system is set to deadline instead of the default cfs as that happened to be the most performant setup according to our bonnie++ test series.

The I/O latency problem is experienced not only inside the guest systems but is also affecting services running on the hypervisor system itself. The setup seems complex, but I'm sure that not the basic structure causes the latency problems, as my previous server ran four years with almost the same basic setup, without any of the performance problems.

On the old setup the following things were different:

  • Debian Lenny was the OS for both hypervisor and almost all guests
  • Xen software virtualisation (therefore no Virtio, also)
  • no LibVirt management
  • Different hard drives, each 1.5TB in size (one of them was a Seagate Barracuda 7200.11 ST31500341AS, the other one I can't tell anymore)
  • We had no IPv6 connectivity
  • Neither in the hypervisor nor in guests we had noticable I/O latency problems

According the the datasheets, the current hard drives and the one of the old machine have an average latency of 4.12ms.


Solution 1:

A 7200RPM SATA drive has can't do 4.12ms latency, that would enable it to do 1/4.12ms (roughly 240) IOs per second which is not realistic.

The proper formula to calculate IOPS for a single disk is 1/(avg_seek_time + avg_rotational_latency) wich for 7200RPM drives equals roughly 75 IOPS. If you have a spec sheet for the disk then you would have two latencies as the drives can absorb writes and reads with different latencies but they are within +-10%.

You can expect latency of 13-15ms per IO from a SATA disk while your queue depth is not too high. Everything between 10 and 15ms would be considered OK; 20ms would hint at latency issues by deep queues (or very large IO request sizes) and 30ms or higher would point to something pathological. Theoretically speaking your 95th percentile should be below 15ms and the system would behave "normally".

Can you provide a measurement of the average service time from host and guest while running your production workload? You can acquire this value by looking at the output of iostat in the "await" column.

Apart from that I'd say that your setup has the maximal possible abstraction latency -- since you layer quite a lot of stuff from the virtual file system down to the physical blocks of the device.

Additionally can you verify that your HBA has a BBWC (or has disk write caches enabled instead) and the file system on the hypervisor and inside the guest are not using barriers?