EC2: Regular performance issues without obvious resource contention

We are running LAMP+memcached on an Ubuntu 9.10 x64 xlarge Amazon EC2 instance. This server handles a few hundred requests per second, of which about 60% are static and the remainder all interact with mysql and/or memcached in some way. This server has been suffering from two performance issues that are possibly related and have proven difficult to diagnose. All statistics below have been gathered with CloudWatch, munin or vmstat/iostat/top unless otherwise specified.

  1. The first problem is recurring regular spikes of high iowait every few minutes during which most apache processes simultaneously iowait for about 10-30 seconds before all un-hanging. There is no increased disk or network load during this time, disk queue stays low, no swapping going on, etc.

  2. More seriously, during peak times the server sometimes suddenly experiences a drastic loss of performance, dropping serviced requests to ~1/3 of what it was a moment before. Once started, this performance dip can last anywhere from 2 to 8 hours before suddenly springing back up to full performance again. When this happens, it's as if the system just stops doing anything. CPU utilization, disk load and network load (as reported by CloudWatch) all simultaneously drop proportionally, yet there is no disk contention. Disk queue and throughput both drop and are well below maximum at all times, especially during these dips. EDIT: This problem has been resolved. Apache was running out of worker processes and for some reason decided this was a good reason to crash performance entirely, even for those processes that were working fine.

The exception is network reads, which remain as high as before, indicating that the server is still being accessed in as high volume as before. If we attempt to contact the server ourselves when this happens, the server is extremely slow and often simply drops the connection before the request can be serviced. It should be noted that neither memory usage or CPU utilization is especially high at any time, whether or not performance is currently tanking: CPU% rarely goes above 10%, the disk is not full or congested. We have not been able to gather data on swap performance during these dips yet, but are attempting to do so.

As it is, we are running low on ideas for what could be causing these mysterious issues and are increasingly worried that this may be a problem (or misfeature) of EC2 itself. The fact that the massive dips always seem to occur when our traffic is peaking (though, again, this does not mean the server is close to maxing out its available resources) cannot simply be coincidence.

All MySQL databases and logs are hosted on an EBS volume and all static content is hosted on a separate, different EBS volume. Apache services 160-240 requests per second and MySQL 180-200 queries per second, with ~0% slow queries and a ~90% hit rate from memcached. Load average tends to hover around 3. Apache access logging is disabled to minimize disk access.


Solution 1:

Because EC2 is a shared hosting environment (your host shares the same hardware with other hosts), you can see substantial variability in I/O operations. EBS volumes are essentially NAS and share the same NIC with network traffic. Each physical host has only a 1Gb connection to the backbone. So, not only do you have contention with other customer's network operations, you also have network contention with their and your disks. In practice, the network contention is not ordinarily a problem unless you are sharing the box with many other high-traffic customers. You can get around some of that by using larger instances (larger instances take up a larger percentage of the box and thus have fewer shared resources).

What kind of iops are you experiencing at peak and during these problem periods? (sar -d tps column)

What is your steal time during these periods? (iostat -x 1 or sar -u).

You can increase your IOP capacity, which should help your iowait time, by software RAIDing multiple EBS volumes together. It sounds kludgy, but it actually works. This will not solve network contention problems, but with your traffic, I highly doubt you are saturating the link. It is possible that another customer is, however, and causing you some pain.

Sometimes, unfortunately, a simple solution to this type of problem is to simply respin the instance. It will likely come up on a different host with different shared customers. It is somewhat common practice for EC2 customers to spin instances, run some benchmarks, and respin if they are unhappy with the results.

Another recommendation is to split your web and database tiers into different servers. A single server with web/db is usually a bad idea for a large number of reasons and in this case is probably making it even more difficult to diagnose the bottleneck.

Solution 2:

Most likely (as noted you found a resolution with the second issue) these issues are configuration or otherwise based. The EC2/EBS/whatever cloud technologies aren't at the root of this. These are issues you'll have on any environment contrary to the answer received so far.

Also - Amazon does provide an SLA. There are minor situations, albeit extremely rare that some resources might come into contention. HOWEVER, it is unlikely considering your current usage. I would continue to do diagnostic research on the various points of contention and also to speak with the technical teams at Amazon Web Services. Also check out their forums as there are usually very knowledgeable people there. You may be aware of the forums, but just in case - check em' out here: https://forums.aws.amazon.com/index.jspa

Also, just from an architectural perspective, have you thought about distributing this load among multiple EC2 instances and load balancing? This is an option that should resolve some of these issues. Also it sounds like, from what architecture you're discussing, that it may be better overall if you split between some slightly less powerful instances and distributed the work. The other advantage is that if your site/services continue to grow, you're in a good position to expand horizontally instead of vertically, the later of course being limited.