Apache2 PHP Site - Hitting MaxClients limit - diagnosing?

We have a moderate-traffic site (roughly 20,000 hits a day), running a PHP/MySQL app on Apache 2.2, Ubuntu 9.10 Server, from an Amazon EC2 c1.small instance (1.7 Gb of RAM).

We had issues with the website repeatedly becoming unresponsive. As a dirty hack, I set the MaxClients/ServerLimit to 450.

<IfModule mpm_prefork_module>
KeepAlive           On
KeepAliveTimeout     7
StartServers          5
MinSpareServers       5
MaxSpareServers      10
MaxClients          450
ServerLimit         450
MaxRequestsPerChild   0
</IfModule>

The site seems to go longer than before, but still dies. Checking the list of processes, I have (third column is physical mem, fourth column is virtual size):

xxxxxxxxx@domU-XXXXXXXXX:/etc/apache2$ ps -eo pid,user,rss,vsz,args | grep apache
 2333 root     11092  39084 /usr/sbin/apache2 -k start
 3704 www-data 11060  41292 /usr/sbin/apache2 -k start
 3826 www-data 10016  39844 /usr/sbin/apache2 -k start
 3954 www-data 11976  41612 /usr/sbin/apache2 -k start
 4061 www-data 11844  41668 /usr/sbin/apache2 -k start
 4064 www-data 10988  40676 /usr/sbin/apache2 -k start
 4084 www-data 11804  41428 /usr/sbin/apache2 -k start
 4086 www-data 10192  39828 /usr/sbin/apache2 -k start
 4099 www-data 11876  41748 /usr/sbin/apache2 -k start
 4100 www-data 10980  40668 /usr/sbin/apache2 -k start
 4102 www-data  8952  39724 /usr/sbin/apache2 -k start
 4107 www-data 11856  41860 /usr/sbin/apache2 -k start
 4108 www-data  9952  39604 /usr/sbin/apache2 -k start
 4109 www-data     0      0 [apache2] <defunct>
 4114 www-data  7172  39724 /usr/sbin/apache2 -k start
 4115 www-data 10968  40668 /usr/sbin/apache2 -k start
 4122 www-data 11888  41844 /usr/sbin/apache2 -k start
 4123 www-data 11584  41444 /usr/sbin/apache2 -k start
 4124 www-data  7036  39596 /usr/sbin/apache2 -k start
 4125 www-data  6744  39084 /usr/sbin/apache2 -k start
 4126 www-data  9532  39552 /usr/sbin/apache2 -k start
 4127 www-data 10112  39812 /usr/sbin/apache2 -k start
 4128 www-data  6600  39084 /usr/sbin/apache2 -k start
 4129 www-data  6736  39084 /usr/sbin/apache2 -k start
 4130 www-data  7004  39596 /usr/sbin/apache2 -k start
 4131 www-data  6740  39084 /usr/sbin/apache2 -k start
 4132 www-data 11616  41596 /usr/sbin/apache2 -k start
 4134 www-data  7024  39588 /usr/sbin/apache2 -k start
 4135 www-data 11808  41516 /usr/sbin/apache2 -k start
 4136 www-data  7008  39460 /usr/sbin/apache2 -k start
 4137 www-data  6988  39460 /usr/sbin/apache2 -k start
 4139 1003       796   3040 grep --color=auto apache
victorhooi@domU-12-31-39-02-B6-34:/etc/apache2$

Is there an easy way to find out what exactly is going on? My understanding of Apache's innards isn't that good, but I would have thought we wouldn't need this many concurrent processes to serve up a page like this, with this sort of traffic. We did inherit the app, so we don't know much about it's insides, but it's a fairly basic CMS-type site, showing a few search results, I didn't think it would need this sort of grunt.

I did run ab against the site, I was getting a fairly lousy request rate (well under 50 a second), but that may have been my poor choice of settings - a lot of those requests seemed to fail.

Where should I be looking for information on what's happening, or any troubleshooting tips I could try?

Cheers, Victor


Solution 1:

450 children with an RSS of around 10mb each is over 4GB of potential memory usage. More than enough to cause your c1.small instance to swap. Swapping is almost always a downward spiral for apache servers.

I'd say the next few things I'd be looking to check are
- does the apache error log mention hitting maxclients
- does dmesg or /var/log/messages mention OOM killer at all
- is the server swapping
- is the memory usage growth slow and steady or spiky/rapid-onset

The first two are just looking at txt files. The third you can do cli but graphs will help, and the fourth you need graphs. Setup apache's mod_status (probably already there just uncomment it) and point munin/collectd/cacti at it.

If you confirm the cause is memory exhaustion and swapping there's tons you can do from there. First off you can lower your maxclients to around 150. That'll leave some room for other stuff and the filesystem cache (mysql on here? if so leave more). RSS is a rough metric to extrapolate like this its just all we got. Once you tune that watch the graphs over time and see if you have room to go up or down. From there you can focus on 1.) skinnier apache children (fewer modules, tighten up the php config) 2.) have apache do less (mix of cdn, alternative http servers, and http proxy options) 3.) upgrayyed $$$

Solution 2:

You may have persistent connections open to your server with a long timeout. As additional clients continue connecting, they take up more and more of the apache processes. With persistent connections on, each client can take 1 (or more) connections to your server.

Check this out for more info: http://httpd.apache.org/docs/1.3/misc/perf-tuning.html

Solution 3:

I consolidated a bunch of performance tuning tips into http://www.anchor.com.au/hosting/dedicated/improving-server-capacity for work recently; it's worked pretty well for the machines I've applied it on recently. Beyond that, if it's a broader problem of machine performance that might not be specific to Apache, I've got a much more in-depth article at http://www.anchor.com.au/hosting/development/HuntingThePerformanceWumpus which covers identifying what component of the system is causing problems.