Finding out what causes CLOSE_WAIT connections with Apache & PHP & Mysql
First, here is a little bit of context.
We have a custom built PHP application that runs under Apache and that supports our website.
We are currently experiencing a high traffic on our website. Here is our current setup : - 10 linux web servers behind a load balancer (each server has 8 CPU's, 30Go RAM) - 1 linux mysql database server (30 CPU's, 120 Go RAM)
The traffic holds fine most of the time, but sometimes for unsure reasons, we see a spike of the total of active mysql connections. It keeps leaking until it reaches the maximum and ultimately causes the application to be unusable by our web users.
When this happens, in terms of load average, memory, CPU's usage, disk swap all the servers are just fine. They have plenty of resource available.
We did notice that there are many Apache processes with a CLOSE_WAIT connection state. We saw approximately 600 processes in that state in one of our web server.
This seems to be a symptom of the problem we are encountering. However, we are having difficulties to dig deeper. Here are my questions :
- Why is Apache hanging on those processes ?
- Are there any tools or debugging techniques that we could use to find out what is causing that ?
- What metrics should we look at to understand what is going on ?
Thanks for your help in advance,
I think you have a query that is locking a table/some rows that other mysql connections are trying to update for longer than it should. When that happens, all the incoming requests stack up behind it until you hit the max connections.
The same is happening on the Apache side due the requests coming in and not having received a response (due to queries being blocked on the database). PHP has an open connection to the database; it has made a query and has yet to receive a response. Apache 'hanging' at that point is what you would expect it to do since it's waiting for an answer.
Apache appears to be hung from the outside (your browser/mobile app/etc) because all of the children available across all your servers are stuck waiting on a database reply. There are literally no more connections available. (This could also be a connection limit set on your load balancer as well). If you aren't already, start logging state changes on your load balancer. You will likely see each of your web servers going up and down repeatedly while the 'thundering herd' problem (explained later) is occurring.
I believe your connections in CLOSE_WAIT are a symptom, not a problem. I wouldn't spend any time trying to troubleshoot that angle until I've taken care of the more obvious possible problems (database). Odds are once you fix that your huge number of CLOSE_WAITs will go away.
To start troubleshooting on the database side, you should enable the slow query log if you haven't already done so. Have it log requests over 1 second or so and see what shows up when the problem occurs.
Note: The slow query log will not log the query until the query has completed. Do not assume that the first query showing up when the problem starts is the problem query. It may or may not be.
Now, you might expect the website to return to normal once the problematic query blocking other queries has finished...
Not so. If you have 500 requests/sec coming in regularly and can handle say 1000 requests/sec total and your query locks up the database for 10 seconds. There are now 5,000 requests WAITING to be handled in addition to the 500/sec that are still coming in. This is known as the Thundering Herd problem.
Your problem could be something else entirely, but these are the exact same symptoms of a problem I have dealt with many times and in most of those cases the problem was a database query blocking other queries. The only other time I've run into this problem that was not due to the database was on CentOS (RHEL has the problem too) 6. Unfortunately Red Hat has the knowledge base article discussing this problem available to subscribers only but there are other references around if you search for them. If you think this might be the case, it's mind-numbingly easy to test. You simply need to add a single line to your resolv.conf.
If the problem seems to show up at the same/close to the same time of the day when it happens you should check your cron jobs (or anything else being run on a set schedule) to see if the problem query is being sent from that.
Finally, if you do determine that you are being bitten by the thundering herd problem I'd suggest setting limits on your load balancer. You should benchmark a server to determine roughly the max number of requests it can handle simultaneously and limit the load balancer from exceeding that number of connections to each back end web server.
Good luck.