How can one efficiently use S3 to back up files incrementally?
I understand how rsync works on a high-level, but there are 2 sides. With S3 there is no daemon to speak of — well there is, but it's basically just HTTP.
There look to be a few approaches.
s3rsync (but this just bolts on rsync to s3). Straightforward. Not sure I want to depend on something 3rd party. I wish s3 just supported rsync.
There also are some rsync 'clones' like duplicity that claim to support s3 without said bolt-on. But how can it do this? Are they keeping an index file locally? I'm not sure how that can be as efficient.
I obviously want to use s3 because it's cheap and reliable, but there are things that rsync is the tool for, like backing up a giant directory of images.
What are the options here? What do I lose by using duplicity + s3 instead of rsync + s3rsync + s3?
Solution 1:
Since this question was last answered, there is a new AWS command line tool, aws
.
It can sync, rsync-like, between local storage and s3. Example usage:
aws s3 sync s3://mybucket /some/local/dir/
If your system's python environment is set up properly, you can install AWS client using pip
:
pip install awscli
Solution 2:
The s3cmd tool has a great sync
option. I use it to sync local backups, using something like:
s3cmd sync --skip-existing $BACKUPDIR/weekly/ s3://MYBACKUP/backup/mysql/
The --skip-existing
means it doesn't try to checksum compare the existing files. If there is a file with that name already, it will just quickly skip it and move on. There is also --delete-removed
option which will remove files not existing locally, but I want to keep on S3 even ones that I have cleaned up locally so I don't use this.