Postgres suddenly slower and higher CPU usage
We're running a Postgres DB that's about 38GB, hosted on a 68GB EC2 instance. It's been humming along, with load at ~0.7 (on an 8 core machine) and little CPU usage until about 16 hours ago, when CPU usage ramped up over the course of a couple of hours and is now much higher than before (around 20% of total), and load average has increased accordingly (now between 5 and 8).
The machine sees between 100 and 300 simultaneous connections, according to pg_stat_activity. Our largest table is about 9.0GB including indexes.
Here's what we've ruled out:
- pg_top shows that there are always a couple of threads running with high CPU percentage, while the rest of the threads below them look fine. There's nothing extraordinary about the SELECTs and INSERTs there.
- IOWait hasn't increased at all, and it doesn't seem like we're hitting disk more often
- As far as we can tell, usage patterns haven't changed; in fact, Monday is slower than weekends
- We increased the size of our memcached instance, to no avail
- We turned off synchronous_commit, but that also did not help much
- We've tried both increasing and decreasing shared buffers (currently set to 2GB); neither 4GB nor 1GB improved things
- Restarting postgres and all our apache servers hasn't helped
- Restarting pgbouncer, which our Apache/Django boxes use to talk to Postgres, also proved unfruitful
- All of our major queries have the correct indexes, and we've verified that postgres is using them. We ran a VACUUM ANALYZE just to be safe, and the query planner looks like it's doing the right thing
- Queries will be fine for a few seconds, then seemingly lock up; an index scan that usually takes 200 ms or less will then take several whole seconds.
Any ideas on how to proceed, or track down this problem?
Solution 1:
Turn on full logging for a short period (~10 minutes) if you can. Save the log, and have pgfouine (http://pgfouine.projects.postgresql.org/) analyse it. It's possible you're doing more queries than before, or you're doing a query which isn't very efficient.
Next it's possible you have some slow queries impacting things badly. Take a look at which queries are running right now:
SELECT pg_stat_activity.procpid AS pid, pg_stat_activity.usename AS username, pg_stat_activity.waiting, now() - pg_stat_activity.query_start AS age, pg_stat_activity.current_query AS query
FROM pg_stat_activity
WHERE pg_stat_activity.current_query <> '<IDLE>'::text
ORDER BY now() - pg_stat_activity.query_start DESC;
Look for the queries at the top of the list. Has one been running for a very long time? If you have a transaction which has been open for 16+ hours, it will definitely slow things down a lot.