C++ Socket Server - Unable to saturate CPU

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).

Details:

  • The server fires up one thread per core
  • Requests are received, parsed, processed, and responses are written out
  • The requests are for data, which is read out of memory (read-only for this test)
  • I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
  • I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)

So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.

Ideas I've had:

  • The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
  • I could modify the HTTP server to use the Select design pattern, is this appropriate here?
  • I could do some profiling to try to understand what the bottleneck's are/is

Solution 1:

boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).

Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.

BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.

Solution 2:

As you are using EC2, all bets are off.

Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.

I have not yet worked out what EC2 is useful for, if someone find out, please let me know.

Solution 3:

From your comments on network utilization,
You do not seem to have much network movement.

3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).

I'd say you are having one of the following two problems,

  1. Insufficient work-load (low request-rate from your clients)
    • Blocking in the server (interfered response generation)

Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).

Solution 4:

230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.

This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?