PostgreSQL EC2 Backup Strategy
I'm trying to determine the best approach to backing up a PostgreSQL database on Amazon EC2. I've read about a couple of options.
1) Take a daily snapshot of the EBS volume your database is using.
The problem I see with this approach is what happens when the snapshot is taken during a write? Won't my data then be corrupt?
2) Use pg_dump, compress the files and store in S3.
pg_dump wont create corrupt data, however restoring the database from pg_dump can be time consuming
What strategy should be used? Option 1 seems tempting because its easier to do and restore but am I taking a risk by using that approach?
Solution 1:
I can't speak for PostgreSQL as I don't use it, but I use a variation on your option 1 for backing up MySQL databases on EC2, and have successfully restored them, without issue.
The first requirement, of course, is that your databases are stored on an EBS volume so that they can be snapshotted. I favour using XFS as the filesystem since the entire filesystem can easily be frozen.
To begin the snapshotting process, you want to freeze your databases, and flush your tables. There is a great script that will do this, as well as freezing your filesystem (if xfs) called ec2-consistent-snapshot (the site does have some comments on PostgreSQL that may point in your an acceptable direction - it is designed for Ubuntu, but, works on other distributions (e.g. Amazon's Linux/CentOS) without much issue). My understanding is that with PostgreSQL, people often simply take the snapshot (after freezing the file system) and rely on PostgreSQL's built-in recovery abilities to restore everything to a functioning state. The xfs_freeze, is still important to get a consistent snapshot though.
Once your file system is frozen (and your database flushed and locked if possible), take your snapshot (ideally, use the API directly, as opposed to the (very) slow Java based commands). The snapshot command only takes a (few) second(s) to return, after which you can unfreeze the file system - the snapshot created will be consistent, despite additional reads.
Given that snapshots are differential (in a way) and compressed, this approach is much more economical than using S3, as well as offering more options, it also allows you to restore data far faster, and should lock up your databases for a shorter period than generating a dump. It is also possible to rotate your snapshots to keep their numbers under control - I wrote a perl script to do that for me.
If in doubt, try the first option, and then create an EBS volume and test the database to see that all works - don't just trust that the backup is good.