Server incremental backup to AWS Glacier

So what happens if I upload a file/archive, then later, the file changes locally, and the next time I do a backup, how does Glacier deal with this since it can't overwrite the file with a new version?

Per the Glacier FAQ:

You store data in Amazon Glacier as an archive. Each archive is assigned a unique archive ID that can later be used to retrieve the data. An archive can represent a single file or you may choose to combine several files to be uploaded as a single archive. You upload archives into vaults. Vaults are collections of archives that you use to organize your data.

So what this means is each file you upload is assigned a unique ID. Upload the same file twice and each copy of the file gets its own ID. This gives you the ability to restore to previous versions of the file if desired.

Use the locally stored archive inventory to determine what data doesn't exist anymore and if it's > 3 months old, delete it from Glacier? That seems straightforward but is there a better approach to this?

To avoid the surcharge for deleting data less than 3 months old this is likely the best approach. But it won't just be the data that doesn't exist any more that you need to track & delete. As mentioned above, any time a file changes and you re-upload it to Glacier you'll get a new ID for the file. You'll eventually want to delete the older versions of the file as well, assuming you don't want the ability to restore to those older versions.

If a 20 MB zip file is uploaded that contains 10,000 files, and one of those files is changed locally, do I need to upload another 20 MB zip file? Now I'm required to eat the cost of storing 2 copies of almost everything in those zip files... Also, how would I deal with deleting things in a ZIP file that don't exist locally anymore? Since I don't want to delete the whole zip file, now I'm incurring fees to store files that don't exist anymore.

That's the tradeoff you really have to decide for yourself. Do you tar/zip everything and then be forced to track those files and everything in them, or is it worth it to you to upload files individually so you can purge them individually as they're no longer needed.

A couple other approaches you might consider:

Have two or more tar/zip archives, one that contains files that are highly unlikely to change (like system files) and the other(s) containing configuration files and other things that are more likely to change over time.
Don't bother with tracking individual files and back everything up in a single tar/zip archive that gets uploaded to Glacier. As each archive reaches the 3-month point (or possibly even later) just delete it. That gives you a very easy way to track & restore from a given point in time.

Having said all that, however, Glacier just may not be the best approach for your needs. Glacier is really meant for data archiving, which is different than just backing up servers. If you just want to do incremental backups of a server then using S3 instead of Glacier might be a better approach. Using a tool like Duplicity or rdiff-backup (in conjunction with something like s3fs) would give you the ability to take incremental backups to an S3 bucket and manage them very easily. I've used rdiff-backup on a few linux systems over the years and found it worked quite nicely.

Here is the command-line tool for *nix, which supports uploading of only-modified files, replacing localy modified files and deleting localy removed files from Glacier https://github.com/vsespb/mt-aws-glacier

Server incremental backup to AWS Glacier

Related

Recent Posts