We were hitting the upper limits of our IOPs the last several days. We provisioned more IOPS a few times. 1250 -> 2500 -> 4500. Each Step we quickly saw we were using all the available IOPS as it would max out between the reads and writes.

Finally this morning we had another large customer onboard and our CPU, Memory, and IOPS maxed out. I made the decision to provisoin a larger database type to keep the system running. We went from db.m5.xlarge to db.m5.2xlarge. We also upped IOPs to 8000 with the upgrade.

Now our CPU is getting hammered in comparison, and our IOPS are practically 0. The app seems responsive but our task queues are quite slow.

No code has changed, just the database. What am I missing? Is RDS not reporting correctly?

last 3 hours

last 3 days

You'll notice the jumps in read IOPS in the last 3 days, those coorilate with the higher IOPS provisions.


Solution 1:

You probably have a query that fell off a cliff. (Thanks for the good description. Here's the likely explanation.)

An example. Let's say a query is scanning a 7GB table (without a useful index, etc) with 70M rows.

Case 1: RAM = 8GB; innodb_buffer_pool_size = 6GB. The query bumps everything out of cache to finish the scan. This takes solid I/O. All other queries are also impacted -- their data was bumped out of RAM, too.

Case 2: RAM = 16GB; innodb_buffer_pool_size = 12GB. The first scan loads the entire table in cache; subsequent runs of that query don't need to do any I/O. Even the other queries run faster because of the lack of contention for the cache.

In both cases, there is probably a bunch of CPU chewed up in looking at all 70M rows.

Plan A: Spend money on having more RAM, and CPU and unneeded IOPs.

Plan B: Find that very naughty query and get us to help optimize it. Then you can go back to a cheaper VM.

Please provide the query and SHOW CREATE TABLE for the table(s) involved. The solution may be as simple as adding a composite index or reformulating the query.