Why is TCP accept() performance so bad under Xen?
Solution 1:
Right now: Small packet performance sucks under Xen
(moved from the question itself to a separate answer instead)
According to a user on HN (a KVM developer?) this is due to small packet performance in Xen and also KVM. It's a known problem with virtualization and according to him, VMWare's ESX handles this much better. He also noted that KVM are bringing some new features designed alleviate this (original post).
This info is a bit discouraging if it's correct. Either way, I'll try the steps below until some Xen guru comes along with a definitive answer :)
Iain Kay from the xen-users mailing list compiled this graph: Notice the TCP_CRR bars, compare "2.6.18-239.9.1.el5" vs "2.6.39 (with Xen 4.1.0)".
Current action plan based on responses/answers here and from HN:
Submit this issue to a Xen-specific mailing list and the xensource's bugzilla as suggested by syneticon-djA message was posted to the xen-user list, awaiting reply.Create a simple pathological, application-level test case and publish it.
A test server with instructions have been created and published to GitHub. With this you should be able to see a more real-world use-case compared to netperf.Try a 32-bit PV Xen guest instance, as 64-bit might be causing more overhead in Xen. Someone mentioned this on HN.Did not make a difference.Try enabling net.ipv4.tcp_syncookies in sysctl.conf as suggested by abofh on HN. This apparently might improve performance since the handshake would occur in the kernel.I had no luck with this.Increase the backlog from 1024 to something much higher, also suggested by abofh on HN. This could also help since guest could potentially accept() more connections during it's execution slice given by dom0 (the host).
Double-check that conntrack is disabled on all machines as it can halve the accept rate (suggested by deubeulyou).Yes, it was disabled in all tests.Check for "listen queue overflow and syncache buckets overflow in netstat -s" (suggested by mike_esspe on HN).
Split the interrupt handling among multiple cores (RPS/RFS I tried enabling earlier are supposed to do this, but could be worth trying again). Suggested by adamt at HN.
Turning off TCP segmentation offload and scatter/gather acceleration as suggested by Matt Bailey. (Not possible on EC2 or similar VPS hosts)
Solution 2:
Anecdotally, I found that turning off NIC hardware acceleration vastly improves network performance on the Xen controller (also true for LXC):
Scatter-gather accell:
/usr/sbin/ethtool -K br0 sg off
TCP Segmentation offload:
/usr/sbin/ethtool -K br0 tso off
Where br0 is your bridge or network device on the hypervisor host. You'll have to set this up to turn it off at every boot. YMMV.
Solution 3:
Maybe you could clarify a little bit - did you run the tests under Xen on your own server, or only on an EC2 instance ?
Accept is just another syscall, and new connections are only different in that the first few packets will have some specific flags - an hypervisor such as Xen should definitely not see any difference. Other parts of your setup might: in EC2 for instance, I would not be surprised if Security Groups had something to do with it; conntrack is also reported to halve new connections accept rate (PDF).
Lastly, there seem to be CPU/Kernel combinations that cause weird CPU usage / hangups on EC2 (and probably Xen in general), as blogged about by Librato recently.