Does this prove a network bandwidth bottleneck?
I've incorrectly assumed that my internal AB testing means my server can handle 1k concurrency @3k hits per second.
My theory at at the moment is that the network is the bottleneck. The server can't send enough data fast enough.
External testing from blitz.io at 1k concurrency shows my hits/s capping off at 180, with pages taking longer and longer to respond as the server is only able to return 180 per second.
I've served a blank file from nginx and benched it: it scales 1:1 with concurrency.
Now to rule out IO / memcached bottlenecks (nginx normally pulls from memcached), I serve up a static version of the cached page from the filesystem.
The results are very similar to my original test; I'm capped at around 180 RPS.
Splitting the HTML page in half gives me double the RPS, so it's definitely limited by the size of the page.
If I internally ApacheBench from the local server, I get consistent results of around 4k RPS on both the Full Page and the Half Page, at high transfer rates. Transfer rate: 62586.14 [Kbytes/sec] received
If I AB from an external server, I get around 180RPS - same as the blitz.io results.
How do I know it's not intentional throttling?
If I benchmark from multiple external servers, all results become poor which leads me to believe the problem is in MY servers outbound traffic, not a download speed issue with my benchmarking servers / blitz.io.
So I'm back to my conclusion that my server can't send data fast enough.
Am I right? Are there other ways to interpret this data? Is the solution/optimization to set up multiple servers + load balancing that can each serve 180 hits per second?
I'm quite new to server optimization, so I'd appreciate any confirmation interpreting this data.
Outbound traffic
Here's more information about the outbound bandwidth: The network graph shows a maximum output of 16 Mb/s: 16 megabits per second. Doesn't sound like much at all.
Due to a suggestion about throttling, I looked into this and found that linode has a 50mbps cap (which I'm not even close to hitting, apparently). I had it raised to 100mbps.
Since linode caps my traffic, and I'm not even hitting it, does this mean that my server should indeed be capable of outputting up to 100mbps but is limited by some other internal bottleneck? I just don't understand how networks at this large of a scale work; can they literally send data as fast as they can read from the HDD? Is the network pipe that big?
In conclusion
1: Based on the above, I'm thinking I can definitely raise my 180RPS by adding an nginx load balancer on top of a multi nginx server setup at exactly 180RPS per server behind the LB.
2: If linode has a 50/100mbit limit that I'm not hitting at all, there must be something I can do to hit that limit with my single server setup. If I can read / transmit data fast enough locally, and linode even bothers to have a 50mbit/100mbit cap, there must be an internal bottleneck that's not allowing me to hit those caps that I'm not sure how to detect. Correct?
I realize the question is huge and vague now, but I'm not sure how to condense it. Any input is appreciated on any conclusion I've made.
Solution 1:
The issue was me assuming that the linode.com graph peaks were true peaks. It turns out the graph uses 5 minute average data points, thus my peak appeared to be 24mbits when in fact I was hitting the 50mbit cap.
Now that they have raised it to 100mbits, my benchmarks immediately went up to the new outbound traffic limit.
If only I had noticed that earlier! A lot of my reasoning hinged on the idea that I was not hitting an outbound traffic limit due to that graph.
Now, I peak at 370 requests per second, which is right under 100mbps at which point I start getting a "backlog" of requests, and response times start going up.
I can now increase maximum concurrency by shrinking the page; with gzip enabled I get 600RPS.
I do still run into problems when I suddenly peak and the backlog of pending requests (limited by bandwidth) start to accumulate, but that sounds like a different question.
It's been a great lesson in optimization / reading this data / narrowing down the possible problems. Thank you so much for your input!
Solution 2:
A bit late now that you figured it out... but maybe you should consider reading the ServerFault blog from time to time.
I'm thinking in particular of this post, where they discuss why having a one second polling interval doesn't cut it from time to time, related to a very similar problem to the one you had..
We discovered that we were discarding packets pretty frequently on 1 Gbit/s interfaces at rates of only 10-30 MBit/s which hurts our performance. This is because that 10-30 MBit/s rate is really the number of bits transfered per 5 minutes converted to a one second rate. When we dug in closer with Wireshark and used one millisecond IO graphing, we saw we would frequently burst the 1 Mbit per millisecond rate of the so called 1 Gbit/s interfaces.
Sure made me think. And I just know know I'm busting that one out on the other SAs at my shop the first chance I get, and will look wickedly brilliant and perceptive when we hit this problem.
Who knows, I may even let some of them in the secret. :)