Poor network performance with KVM (virtio drivers) - Update: with vhost_net

I've set up several KVM based networks before, and never encountered this issue, can't for the life of me think what I'd have set up differently previously.

Setup

Basically, Ive got and entirely Dell Stack:

  • 2x Dell N2024's (stacked gigabit switches)
  • Several Dell R720's for KVM Hypervisors
  • 2x Dell R320's for gateway/firewalls

All machines run CentOS6.5, the hypervisors, basically standard install with a few sysctl tweaks.

At the moment, I've got a few test VM's setup, with similar setup to their masters (CentOS 6.X, base install with basic puppet driven configuration). All VM's are:

  • Bridged to one of two physically separated networks (i.e each hypervisor has two ethernet connections, one for a public/DMZ bridged LAN, the other, a private one)
  • All VM's use virtio for network, block devices (basically bog standard result of running the virt-install command) -- e.g (example libvirt config)

    <interface type='bridge'>
          <mac address='52:54:00:11:a7:f0'/>
          <source bridge='dmzbr0'/>
          <model type='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    
  • and all VM's have access to between 2 and 8 VCPU's and 8 and 64GB RAM, and their drives are LVM volumes on the host machine

Some simple file copies within the VM, and dd tests yield perfectly acceptable results (300MB/s - 800MB/s in these small scale synthetic tests)

Network Performance between Physical Machines

I've left Jumbo Frame/MTU configurations for now, and server to server transfer will quite happily max out the gigabit connection (or there about) (100MB/s -> 118MB/s flat over several large file tests to/from each machine)

Network Performance between a Physical Machine and VM (and VM to VM)

Rsync/SSH transfer consistently changing (unstable) but always between 24MB/s and a max of about 38MB/s

I've performed several other tests: - Between a Physical machines IP on one bridge to the VM (on another bridge) - Between a Physical machines IP on one bridge to the VM (on the same bridge) - Tried starting the VM's using e1000 device drivers instead of virtio

Nothing seems to have worked, has anyone encountered this much of a performance degradation before? I've just checked my older network (hosted at another DC), and apart from the fact it uses a different switch (a very much cheaper old PowerConnect 2824) the VM network performance seems to be closer to 80-90% of raw network performance (not less than half)

If I can provide any setup/configs or extra information, I'm more than happy to!

Update (14/08/2014)

Tried a few things:

  • Enabled Jumbo frames/MTU 9000 on host bridge and adapter and VM's (marginal performance improvement (average above 30MB/s)
  • Tested GSO,LRO,TSO off/on on host (no noticeable effect)
  • Tested further sysctl optimisations (tweaking rmem/wmem, with sustained 1-2% performance increase)
  • Tested vhost_net driver (small increase in performance)
  • vhost_net driver enabled (as above) with the same sysctl optimisations (at least a 10-20% performance jump on previously)
  • as per redhat's performance optimisation guide they mentioned enabling multiqueue could help, though I noticed no difference.

The host seems to sit at 125% CPU (for the host process), could this have something to do with assigning too many VCPU's to the guest or CPU/Numa affinity?

However, after all that, I seem to have increased the average sustained rate of between 25-30MB/s to 40-45MB/s. It's a decent improvement, but I'm sure I can get closer to bare metal performance (it's still a fair way under half at the moment).

Any other ideas?


Solution 1:

Your KVM instances should be able to saturate your hosts network connection with no issues.

My first recommendation here is to upgrade both the host and guest's kernel. The stock CentOS 6.5 kernel does not have great performance for KVM. I'd suggest kernel-lt from ELRepo (or kernel-ml if you're feeling brave). This should give you a decent boost in performance right off the bat.

Next up, try testing with iperf3 (or even the older iperf). This will give you as close to a pure network connection as possible. Your rsync/ssh tests are not really valid, because they're definitely hitting the disk. RSync especially may not be doing sequential IO like your dd test (try using fio instead).

The interesting thing here is that VM to VM traffic will not actually hit the network controller. This is going to be done purely on the host, so the rest of your network (and the various offload settings) don't really have any meaning here.

One other thing to check: Has your server throttled down the CPUs? We've had a number of Dell machines think they were idle, and start running the CPU significantly slower then they should have been. The power saving stuff does not always recognize server workloads well.

You definitely want virtio here, don't even waste your time testing any of the emulated options.

You didn't mention it, but if your server has the i350 based NICs, you can look into SR-IOV (assuming you only want <= 7 VMs per machine). This gives the VM direct access to the physical NIC (at the cost of loss of functionality, such as no nwfilter support), and will be more efficient. You do not need this to get full gigabit speeds though.