AWS RDS db.t2 instance performance thresholds & monitoring
We have been rolling out a standard web server configuration for mainstream CMS software like Drupal & WordPress, with the server & storage on EC2 / EBS and the database for those software packages in RDS / MySQL.
Usually we go into production with a t2.micro CPU and a db.t2.micro DB, which makes clients happy with us & AWS since they can often stay on the free tier for the first year. The default monitoring tools on EC2 show clearly when we might be exceeding the dearest resource for the web host, which is CPU Utilization. If the threshold nears or passes 10% then we know the time has come to migrate to the t2.small instance type.
We are far less certain how to determine when we might need to upgrade from db.t2.micro to db.t2.small & perhaps beyond. These requirements wouldn't involve multi-AZ or read replicas, just conditions when CMS software might lean heavily on the database during peak periods that we will need to spot via a graph or alarm.
The docs for EC2 instances clearly indicate what their own limits are, and I was wondering if any such limits for RDS instances might be recommended for our simple case. The general requirements in their Best Practices for Amazon RDS are helpful, though I haven't followed all the links since I am simply trying to set thresholds that we can put in place that will clearly mandate a DB instance upgrade in a manner my non-technical clients can understand & observe.
I confess I am not a DBA; by the nature of my work I have left the database architecture to the designers of the CMS software. I am certainly willing to learn the basics of performance assessment if someone will tell me where to start as it relates to this configuration on the AWS platform. Maybe I just haven't found the right official docs or tutorials yet.
Alternatively: we just need to know how to measure quantitatively if any delay in accessing our RDS instance is the result of the instance size being too small (or perhaps the MySQL resource parameters set too low) based on what we are seeing on CloudWatch.
Trivially, I can tell if the CloudWatch metric Freeable Memory gets close to zero then we would need an instance upgrade. And as with our EC2 instance there must also be a maximum CPU Utilization which I guess would be far below 100%, though again I haven't seen this documented like I have for EC2. I imagine there would be a practical maximum for DB Connections. Finally I hope someone will tell me how to interpret Write IOPS & Read IOPS and if these would impose performance limitations on small configurations like ours or if they are simply used to compute cost.
p.s., I tried to post this on AWS Forums: Amazon Relational Database Service but the Post New Thread link currently yields a "Redirect loop." (Sorry I can't include more URLs in here, but I'm not allowed.)
[edit, response to comment] thanks @Ross, I didn't know CPUCreditBalance was also available on RDS (I'd seen it on EC2); didn't see there was a second screen with 7 more metrics with all 17 selectable from a list. I'm still wondering what limitations might be imposed on monitorable resources other than CPU, especially I/O activity, according to RDS instance type.
p.p.s., I have refined question a bit more & posted on AWS forums (How to determine if RDS T2 instances are right sized with CloudWatch stats?)
I have had some perspective on this in the last few months & I believe these items to watch will address all the concerns above:
1) The comment from @Ross on the original posting is the key. T2 instances, no matter what scale and no matter whether they are EC2 or RDS, will stop performing when their CPU credits run out as the peak CPU demands continue.
2) The failure mode of a CMS web server we have seen most often is shown exactly by this condition: the CloudWatch graph dives towards zero when the CPU percentage needed by httpd
processes exceeds the CPU percentage assigned to that instance type (see doc link below).
3) The quick solution for a T2 instance that has exhausted CPU credits is to shut down, upgrade the instance type, and start up the instance again, which takes about 3-4 minutes. The most vital description of the capacities of different instance types is here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html
4) Any production web server on AWS must have an Elastic IP address assigned in advance for this reason: if not, and the instance is rescaled, the IP address will change, leaving the web server inaccessible far beyond what would otherwise only be 3-4 minutes of downtime.
5) The only way to acquire more CPU credits is to upgrade the machine type. The amount of credits each T2 instance size can hold is described in the doc link above: it is always equal to the CPU work that instance type would do in 24 hours.
6) The machine can be returned to its original scale during a bit of scheduled downtime (again, 3-4 minutes) after peak performance demands die down.
7) I/O activity hasn't caused any performance degradation for our web server in any peak periods so far. The amount of IOPS is determined strictly by EBS volume size. Both the exact meaning of IOPS, and that relationship, are described here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-io-characteristics.html
8) Neither of the Cloud Watch metrics Freeable Memory nor DB Connections were of any use anticipating or correcting performance problems in a web server intensive environment.