Amazon EC2 Backup Strategy With Restrictions (little to no snapshots can be taken?)

Solution 1:

There is something interesting about this question - specifically with regard to the idea of downtime. Part of idea being that if an application is sensitive to downtime, then recovery time must also be factored in. (As an extreme argument, no backups require no downtime, unless you happen to need those backups, in which case the downtime may approach infinity).

A bit about EBS

EBS volumes and snapshots operate at a block level - a consequence of which allows snapshots to be taken while an instance is running, even if the EBS volume is in use. However, only data that is actually on the disk (i.e. not in a file cache) will be included in the snapshot. It is the latter reason that gives rise to the idea of consistent snapshots.

  • The recommended way is to detach the volume, snapshot it, and reattach it - usually not practical.
  • The next best option involves flushing the write-caches to disk, freezing the file system, and taking your snapshot

An interesting point here is that in both cases above, you do not need to wait for the snapshot to finish to reattach/unfreeze and resume writing to the disk - once the snapshot has been initiated your data will be consistent to that point in time. Typically this is only requires a few seconds during which your disk is write locked. Also since most databases structure their files on disk in a reasonable manner - there is a good chance that inserts have a minimal effect the existing blocks - which minimizes the data added to the snapshot.

Consider the point of the backup

EBS volumes are already replicated within an availability zone - so there is a degree of redundancy built in. If your instance terminates, you can simply attach the EBS volume to a new instance and (after you get past the lack of consistency) resume where you left off. In many regards this makes the EBS volume much like an inconsistent snapshot, provided that you can access it. That said, most EC2 users probably recall the cascading failures of EBS volumes from early 2011 - volumes were inaccessible for multiple days, and some users lost data as well.

RAID1

If you are trying to safeguard against the failure of an EBS disk (it does happen), you may consider a RAID1 setup. Since EBS volumes are block devices, you can easily use mdadm to set them up in your desired configuration. If one of your EBS volumes isn't performing to spec, it is easy enough to manually fail it (and later replace it with another EBS volume). Of course, this has downsides - increased time for every write, greater susceptibility to variable performance, double the I/O cost (monetariliy, not performance-wise), no real protection against a more widespread AWS failure (a common problem last year was the inability to detach EBS volumes that were in a locked state), and of course, the inconsistent state of the disk on failure.

S3FS

An option for certain applications (definitely NOT for databases) is to mount S3 as a local file system (e.g. via s3fs). This is slow, lacks some of the features one would expect from a file system, and may not behave as expected (eventual consistency). That said, for a simple purpose like making uploaded files available across instances, it may have merit. Obviously it isn't for anything that requires good read/write performance.

MySQL bin-log

One more option specific to MySQL may be the use of the bin-log. You can setup a second EBS volume that will store your bin-log (to minimize the effect of the added writes on your database), and use that in conjunction with whatever database dumps you take. You might even be able to do this with s3fs, which may actually have merit if the performance is tolerable (an rsync would probably better though than trying to use s3fs directly, and you will definitely want to compress what you can).

Once again, though, we come back to the idea of purpose. Consider what would happen with the above suggestions:

  • EBS volumes inaccessible:
    • RAID1 - useless, since you can't get to the data
    • bin-log - useless, unless you exported it to S3 - probably a delay though if you did that
  • Instance terminates unexpectedly:
    • RAID1 - your disks are available, but not consistent, your database may recover from the inconsistency on its own
    • bin-log - your data should be accessible, although you may need to review the last few events
  • Someone runs DROP DATABASE as root:
    • RAID1 - you have two perfect copies of a non-existent database
    • bin-log - you should be able to replay the events up to just before the command, so you should be ok

So really, RAID1 is mostly useless, and bin-log takes too long - both may have merit under certain circumstances, but are far from the idea backup.

Snapshots

It is important to note that snapshots are differential, and only store the actual blocks that contain data (and are compressed). Unlike with an EBS volume, where if you have a 20GB volume, but only use 1GB, you are still charged for the 'provisioned' storage (20GB). With a snapshot, you only are charged for what you use. If no data changes between snapshots, there is (theoretically) no charge. (Snapshots are charged for PUTS/GETS and used storage).

As an aside, I would highly recommend your application data (including databases) not be stored on your root volume (which you may already have setup). One of the advantages is that, hopefully, your root volume sees a minimum of change - meaning that its snapshots can be less frequent (or will have a minimum of change) reducing cost and ease of use.

It is also relevant to mention that you can delete old snapshots at any time - even though they are differential they will not affect the other snapshots. That said, each block allocated to a snapshot will not be relinquished until there is no snapshot that still references that block.

The problem with periodic dumps is firstly the time between dumps (possibly addressed by using MySQL's bin-log) and also the difficulty of recovery. It takes time to import a large dump and replay all the events from a bin-log. Also, creating a dump is not without its performance implications. Arguably, such dumps likely require far more storage than a snapshot. Setting up an EBS volume solely for the databases and snapshotting that would be preferable in most regards (that said, taking a snapshot does have a bit of a performance implication as well).

The beauty of snapshots and EBS volumes is that they can be used on other instances. If your instance fails to boot, you can attach the root volume to another instance to diagnose and fix the problem - or just to copy your data off it - and can switch root volumes with only a couple minutes of downtime (stop the instance, detach the root volume, attach a new root volume, start the instance). This same idea applies to having your data on a second EBS volume. Essentially, you just spin up a new instance from your custom AMI, and attach your current EBS volume to it - it helps minimize downtime.

(One could make the argument (and I probably wouldn't recommend it) that you could setup two copies of MySQL on the same server (Master-slave), using two EBS volumes, and then shutdown your slave to take a snapshot of its EBS volume - it will be consistent, with no downtime - but the performance costs are likely not worth it).

AWS does have autoscaling - which will be able to maintain a constant number of instances (even if that number is 1) - you would deploy from a snapshot however - so if your snapshot is not useful, then the premise isn't of much use.

Another couple of points - you can deploy as many instances as you want from a single snapshot (unlike an EBS volume, which can only be attached to a single instance at any given time). Also, EBS volumes are restricted to use within an availability zone, while snapshots can be used within a region.

Ideally, with a snapshot, if your server goes down, you can just launch a new one using the last snapshot - especially if you separate your root volume from your data, a bad update should result in a minimum of downtime (since you would just transfer the EBS volume containing your data across - and take a snapshot of it to preserve anything that might get corrupted due to inconsistency).

As a side note, Amazon states the failure rate of EBS volumes increases with the amount of data changed on them since the last snapshot.

Final recommendations

  • Use snapshots - they are great - they reduce downtime much more than they cause it
  • Separate data and the root volume, perhaps even putting the databases on their own volume, and snapshot before updates if necessary
  • Use the bin-log to stay as 'hot' as possible - upload this (compressed) to S3
  • Ensure you actually get the data off the instance (even if the data is intact on an EBS volume, the volume itself might be temporarily inaccessible).

Recommended Reading:

  • Amazon's page on EBS
  • EBS FAQs

(I do believe I have written too much, but not said enough - but hopefully you find something worth the read).

Solution 2:

It is possible to snapshot a live EBS volume, however you must take care to make sure that the filesystem is in a consistent state and then frozen while the snapshot is initiated. Not all filesystems allow for this, though it is definitely possible and the basis of our own backup solution.

EBS snapshots are also pretty cheap as they only charge for changed data, and the data charges are very reasonable in and off themselves. Keep in mind though, that this is based on a block level changes, so can change rather quickly. This is also true between snapshots, only data changed between snapshots is charged. To give you an idea of costs, we pay <$10 per month for snapshot storage, and we take daily snapshots, keeping the last 7 dailies and last months worth of weekly snapshots, and we have 2 servers following this scheme (~20 snapshots, 20GB hard drives).

Solution 3:

How about a cheap inexpensive backup solution like Zmanda Cloud Backup? We use it to backup about 6 servers and 1 SQL Server and it's only about $10 per month. You can encrypt your data with a self signed cert and they use s3 to store the data (so there are no data transfer fees if you are backing up from EC2).