Amazon EC2 Data Persistence

According to the Amazon EC2 FAQ, when an instance is terminated the data is gone. What steps can I take to preserve data in the event my instance is rebooted? I've been looking into EBS and S3 - would either of these be useful to store an active database? How often are instances rebooted anyways?


Solution 1:

Like others have said, EBS--Elastic Block Storage. I am using it myself now that it is released to the general public. It is better than S3 on multiple points:

  • EBS are fast. Faster than even the local mounts, according to Amazon.
  • EBS mounts as proper devices. Unlike S3, which you'll need custom S3 oject access logic in your code, or middleware (JungleDisk, ElasticDisk, et al) which present their own problems and costs
  • EBS are easy to back up. Amazon give one the ability to take snap shots, which are saved on S3
  • EBS are portable between instances--volumes can be unmounted from one instance, and attached to another instance
  • EBS devices can even be RAID'ed together for improved reliability

My experience with EBS so far has been the most positive thing about AWS I've dealt with to date.


Update: While my experience with EBS has been positive, others have had issues. Very specifically EBS do not implement fsync() correctly. Ted Dziuba has some interesting words about this in his blog post Amazon — The Purpose of Pain: Myth 2: Architecture Will Save You from Cloud Failures

This gets even more entertaining with Amazon Elastic Block Store, which, as the Reddit administrators have found, will happily accept calls to fsync(), and lie to your face, saying that the data has been written to disk, when it may not have been.

Solution 2:

EBS would certainly work for a database, and is one of the examples in Amazon's EBS Description. "Amazon EBS is particularly suited for applications that require a database..."

EBS will work just like a block device (think hard disk), giving you broad freedom and comfortability using it. S3 is conceptually more like really fast FTP, with a special API. You could conceivably use it as part of a database, but standard databases don't run on it (yet?).

You will want to review Amazon's descriptions of performance (EBS > S3), durability (S3 > EBS) and price (depends).

Solution 3:

As mentioned in other answers, EBS is the standard solution for persistent and convenient disk storage. It should be your default option. Indeed, the newer EC2 instance types use it by default, over the original, non-persisted instance storage.

However, when considering persistence, you'll also want to carefully consider availability (avoiding times when the data is not available) and durability (avoiding loss of data) for your data.

EBS covers the basic use cases, but keep in mind:

  • S3 is designed for higher durability than EBS. Simply put, they keep more copies of your data, and claim extremely high 99.999999999% durability for S3 (see the S3 FAQ). The actual numeric probability here (which is so high it makes martian invasions look more probable) is not as important as the fact that AWS has staked their reputation on S3 durability and has a very good record here. The same is not true for EBS.
  • While Amazon will not give statistics on this directly, many people believe instance storage has historically offered higher availability than EBS.

Recent AWS outages, such a severe multi-day failure in 2011 and another in 2012, give some illustrations of the complexities of EBS and the small but non-negligible risks of outages and data loss.

Bottom line: To be sure you won't lose your data, keep data backups in S3. EBS snapshots are an easy way to do this for EBS. If high availability is critical, consider also using instance storage in multiple availability zones (in addition to, or instead of, EBS).