KVM guest io is much slower than host io: is that normal?
I have a Qemu-KVM host system setup on CentOS 6.3. Four 1TB SATA HDDs working in Software RAID10. Guest CentOS 6.3 is installed on separate LVM. People say that they see guest performance almost equal to host performance, but I don't see that. My i/o tests are showing 30-70% slower performance on guest than on host system. I tried to change scheduler (set elevator=deadline
on host and elevator=noop
on guest), set blkio.weight
to 1000 in cgroup, change io to virtio... But none of these changes gave me any significant results.
This is a guest .xml config part:
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/dev/vgkvmnode/lv2'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>
There are my tests:
Host system:
iozone test
# iozone -a -i0 -i1 -i2 -s8G -r64k
random random
KB reclen write rewrite read reread read write
8388608 64 189930 197436 266786 267254 28644 66642
dd read test: one process and then four simultaneous processes
# dd if=/dev/vgkvmnode/lv2 of=/dev/null bs=1M count=1024 iflag=direct
1073741824 bytes (1.1 GB) copied, 4.23044 s, 254 MB/s
# dd if=/dev/vgkvmnode/lv2 of=/dev/null bs=1M count=1024 iflag=direct skip=1024 & dd if=/dev/vgkvmnode/lv2 of=/dev/null bs=1M count=1024 iflag=direct skip=2048 & dd if=/dev/vgkvmnode/lv2 of=/dev/null bs=1M count=1024 iflag=direct skip=3072 & dd if=/dev/vgkvmnode/lv2 of=/dev/null bs=1M count=1024 iflag=direct skip=4096
1073741824 bytes (1.1 GB) copied, 14.4528 s, 74.3 MB/s
1073741824 bytes (1.1 GB) copied, 14.562 s, 73.7 MB/s
1073741824 bytes (1.1 GB) copied, 14.6341 s, 73.4 MB/s
1073741824 bytes (1.1 GB) copied, 14.7006 s, 73.0 MB/s
dd write test: one process and then four simultaneous processes
# dd if=/dev/zero of=test bs=1M count=1024 oflag=direct
1073741824 bytes (1.1 GB) copied, 6.2039 s, 173 MB/s
# dd if=/dev/zero of=test bs=1M count=1024 oflag=direct & dd if=/dev/zero of=test2 bs=1M count=1024 oflag=direct & dd if=/dev/zero of=test3 bs=1M count=1024 oflag=direct & dd if=/dev/zero of=test4 bs=1M count=1024 oflag=direct
1073741824 bytes (1.1 GB) copied, 32.7173 s, 32.8 MB/s
1073741824 bytes (1.1 GB) copied, 32.8868 s, 32.6 MB/s
1073741824 bytes (1.1 GB) copied, 32.9097 s, 32.6 MB/s
1073741824 bytes (1.1 GB) copied, 32.9688 s, 32.6 MB/s
Guest system:
iozone test
# iozone -a -i0 -i1 -i2 -s512M -r64k
random random
KB reclen write rewrite read reread read write
524288 64 93374 154596 141193 149865 21394 46264
dd read test: one process and then four simultaneous processes
# dd if=/dev/mapper/VolGroup-lv_home of=/dev/null bs=1M count=1024 iflag=direct skip=1024
1073741824 bytes (1.1 GB) copied, 5.04356 s, 213 MB/s
# dd if=/dev/mapper/VolGroup-lv_home of=/dev/null bs=1M count=1024 iflag=direct skip=1024 & dd if=/dev/mapper/VolGroup-lv_home of=/dev/null bs=1M count=1024 iflag=direct skip=2048 & dd if=/dev/mapper/VolGroup-lv_home of=/dev/null bs=1M count=1024 iflag=direct skip=3072 & dd if=/dev/mapper/VolGroup-lv_home of=/dev/null bs=1M count=1024 iflag=direct skip=4096
1073741824 bytes (1.1 GB) copied, 24.7348 s, 43.4 MB/s
1073741824 bytes (1.1 GB) copied, 24.7378 s, 43.4 MB/s
1073741824 bytes (1.1 GB) copied, 24.7408 s, 43.4 MB/s
1073741824 bytes (1.1 GB) copied, 24.744 s, 43.4 MB/s
dd write test: one process and then four simultaneous processes
# dd if=/dev/zero of=test bs=1M count=1024 oflag=direct
1073741824 bytes (1.1 GB) copied, 10.415 s, 103 MB/s
# dd if=/dev/zero of=test bs=1M count=1024 oflag=direct & dd if=/dev/zero of=test2 bs=1M count=1024 oflag=direct & dd if=/dev/zero of=test3 bs=1M count=1024 oflag=direct & dd if=/dev/zero of=test4 bs=1M count=1024 oflag=direct
1073741824 bytes (1.1 GB) copied, 49.8874 s, 21.5 MB/s
1073741824 bytes (1.1 GB) copied, 49.8608 s, 21.5 MB/s
1073741824 bytes (1.1 GB) copied, 49.8693 s, 21.5 MB/s
1073741824 bytes (1.1 GB) copied, 49.9427 s, 21.5 MB/s
I wonder is that normal situation or did I missed something?
You're not done with performance tuning yet.
<driver name='qemu' type='raw' cache='writethrough' io='native'/>
First is which I/O mechanism to use.
QEMU has two asynchronous I/O mechanisms: POSIX AIO emulation using a pool of worker threads and native Linux AIO.
Set either io='native'
or io='threads'
in your XML to benchmark each of these.
Second is which caching mechanism to use. You can set cache='writeback'
, cache='writethrough'
or you can turn it off with cache='none'
, which you actually may find works best.
If you're using raw volumes or partitions, it is best to avoid the cache completely, which reduces data copies and bus traffic.
Don't use writeback
unless your RAID array is battery-backed, or you risk losing data. (Of course, if losing data is OK, then feel free.)
Third, some other things that may help include turning off barriers, and using the deadline scheduler in the guest.
Finally, do some research. IBM made a very interesting presentation on KVM I/O performance at the 2010 Linux Plumbers Conference. In addition they have an extensive set of best practices on using KVM which will certainly be of interest.
P.S. Lengthy sequential reads and writes are rarely representative of a real-world workload. Try doing benchmarks with other types of workloads, ideally the actual application(s) you intend to run in production.