How to debug apache timeouts?

Solution 1:

The first thing I note, looking at your first graph, there seems to be an hourly slowdown (occurring around 40 minutes past the hour) which may be contributing to the problem. You should have a look at the task schedulers on the OS / database.

Based on the data you've supplied, my next step woud be to look at the frequency of response times (number of responses on Y axis vs duration on X) but only including URLs which exhibit the timeout (or preferably one URL at a time). On a typical system this should follow a normal or poisson distribution - the requests which are timing out may simply be part of the tail - in which case you need to focus your efforts on general tuning. OTOH if the distribution is bi-modal then you need to look for contention somewhere in your code.

Solution 2:

I have another thought on this, based on the fact that you get a large number of requests per day, and seem to have timeouts only during peak hours (from the pictures you posted).

There's a post on the Server Fault blog, Per Second Measurements Don't Cut It... is it possible some of these requests are running into the same problem the ServerFault team ran into?

We discovered that we were discarding packets pretty frequently on 1 Gbit/s interfaces at rates of only 10-30 MBit/s which hurts our performance. This is because that 10-30 MBit/s rate is really the number of bits transfered per 5 minutes converted to a one second rate. When we dug in closer with Wireshark and used one millisecond IO graphing, we saw we would frequently burst the 1 Mbit per millisecond rate of the so called 1 Gbit/s interfaces.