Get an AWS EC2 ebs volume to perform over 20,000 IOPS

The company I work for is currently going through an AWS migration, and for 99% of services, their commodity hardware does the job just fine.

Except for the production Database, we currently sit at 60,000 IOPS just to keep up with requests, and it's due to see much more action this year.

We've looked at using enterprise SSD's on EC2, but the IOPS hard limit is 20,000, which is pretty terrible, considering I can get a 240GB SSD that performs at 80,000 IOPS for about €200. http://www.techradar.com/reviews/pc-mac/pc-components/storage/disk-drives-hdd-ssd/intel-ssd-520-series-240gb-1060850/review

Any idea's how I can get past this limit? Is a cluster/RAID of ebs volumes possible?

Thanks, Ben


Solution 1:

RAID of EBS volumes is certainly possible. Amazon even has documentation on it: http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/raid-config.html

They're presented to the OS as simple devices, so you can use the OS's software-RAID on them. I've done this with Linux 'mdadm' software-RAID without difficulty.

Make sure the instance-type you choose can handle the high I/O and networking, and of course consider the failure modes.

Solution 2:

That such IO rates in a single EBS volume are simply not possible, at least not right now. As mentioned, RAIDs of EBS volumes should fit your needs, both GP2 and PIOPS, but an upper limit per instance of 65K IOPS will still apply.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

Should you require even higher IO rates, only instances with SSD instance store will grant you such power. Talking about the UPPER limit for this use case, Amazon just recently released the new I3 family, with instances providing up to eight 1.9 TB high performace NVMe SSD drives, granting you an astonishing 3.3 million IOPS and 16GB/s of disk bandwidth.

Unfortunately for your use case, instance storage may be too risky for a transactional SQL workload, but should IO be a critical issue for your business development, maybe you should consider overriding that risk by implementing extensive backup and disaster recovery policies, at least until you can afford an architecture evolution that scales better.