Performance Tuning a High-Load Apache Server
Solution 1:
I'll start by admitting that I don't much about running stuff in clouds - but based on my experience elsewhere, I'd say that this webserver config reflects a fairly low volume of traffic. That the runqueue is so large suggests that there just isn't enough CPU available to deal with it. What else is in the runqueue?
We may be allowing far too many KeepAlive requests
No - keeplive still improves performance, modern browsers are very smart about knowing when to pipeline and when to run requests in parallel, although a timeout of 5 seconds is still rather high, and you've got a LOT of servers waiting - unless you've got HUGE latency problems I'd recommend cranking this down to 2-3. This should shorten the runqueue a little.
If you've not already got mod_deflate installed on the webserver - then I'd recommend you do so - and add the ob_gzhandler() to your PHP scripts. You can do this as an auto-prepend:
if(!ob_start("ob_gzhandler")) ob_start();
(yes, copression uses more CPU - but you should save CPU overall by getting servers out of the runqueue faster / handling fewer TCP packets - and as a bonus, your site is also faster).
I'd recommend setting an upper limit on MaxRequestsPerChild - say something like 500. This just allows some turnover on processes in case you've got a memory leak somewhere. Your httpd processes look to be HUGE - make sure you've removed any apache modules you don't need and make sure you're serving up static content with good caching information.
If you're still seeing problems, then the problem is probably within the PHP code (if you switch to using fastCGI, this should be evident without any major performance penalty).
update
If the static content doesn't vary much across pages, then it might also be worth experimenting with:
if (count($_COOKIE)) {
header('Connection: close');
}
on the PHP scripts too.
Solution 2:
You should consider installing an asynchronous reverse proxy, because a number of processes in W state is quite high too. Your Apache processes seem to spend a lot of time sending content to slow clients over network being blocked on that. Nginx or lighttpd as a frontend to your Apache server can reduce a number of processes in W state dramatically. And yes, you should limit a number of keepalive requests. Probably it is worth trying to turn keepalive off.
BTW, 107 Apache processes are too high for 22 rps, I was able to serve 100-120 rps using only 5 Apache processes. Probably, the next step is to profile your application.
Solution 3:
You have two rows in your vmstat that show your CPU wait time is fairly high, and around those, you do a fair number of writes (io - bo) and context switching. I would look at what's writing blocks, and how to eliminate that wait. I think the most improvement could be found in improving your disk IO. Check syslog - set it to write async. Make sure your controller's write cache is working (check it -- you might have a bad battery).
Keepalive isn't causing your perf problem, it saves you time on connection setup if you're not running a cache in front. You might bump MaxSpareServers a bit so that in a crunch you're not waiting for all the forks.