Postgres suddenly slower and higher CPU usage

We're running a Postgres DB that's about 38GB, hosted on a 68GB EC2 instance. It's been humming along, with load at ~0.7 (on an 8 core machine) and little CPU usage until about 16 hours ago, when CPU usage ramped up over the course of a couple of hours and is now much higher than before (around 20% of total), and load average has increased accordingly (now between 5 and 8).

The machine sees between 100 and 300 simultaneous connections, according to pg_stat_activity. Our largest table is about 9.0GB including indexes.

Here's what we've ruled out:

  • pg_top shows that there are always a couple of threads running with high CPU percentage, while the rest of the threads below them look fine. There's nothing extraordinary about the SELECTs and INSERTs there.
  • IOWait hasn't increased at all, and it doesn't seem like we're hitting disk more often
  • As far as we can tell, usage patterns haven't changed; in fact, Monday is slower than weekends
  • We increased the size of our memcached instance, to no avail
  • We turned off synchronous_commit, but that also did not help much
  • We've tried both increasing and decreasing shared buffers (currently set to 2GB); neither 4GB nor 1GB improved things
  • Restarting postgres and all our apache servers hasn't helped
  • Restarting pgbouncer, which our Apache/Django boxes use to talk to Postgres, also proved unfruitful
  • All of our major queries have the correct indexes, and we've verified that postgres is using them. We ran a VACUUM ANALYZE just to be safe, and the query planner looks like it's doing the right thing
  • Queries will be fine for a few seconds, then seemingly lock up; an index scan that usually takes 200 ms or less will then take several whole seconds.

Any ideas on how to proceed, or track down this problem?


Solution 1:

Turn on full logging for a short period (~10 minutes) if you can. Save the log, and have pgfouine (http://pgfouine.projects.postgresql.org/) analyse it. It's possible you're doing more queries than before, or you're doing a query which isn't very efficient.

Next it's possible you have some slow queries impacting things badly. Take a look at which queries are running right now:

SELECT pg_stat_activity.procpid AS pid, pg_stat_activity.usename AS username, pg_stat_activity.waiting, now() - pg_stat_activity.query_start AS age, pg_stat_activity.current_query AS query FROM pg_stat_activity WHERE pg_stat_activity.current_query <> '<IDLE>'::text ORDER BY now() - pg_stat_activity.query_start DESC;

Look for the queries at the top of the list. Has one been running for a very long time? If you have a transaction which has been open for 16+ hours, it will definitely slow things down a lot.