AWS EC2 snapshots - how long should they be retained? [closed]
How long should daily EBS-backed EC2 snapshots be retained? We're using ec2-automate-backup to backup (daily) two EBS volumes - OS and data - pertaining to a web application. If I've understood, in the event of failure we could create new instances from the most recent snapshots.
However, I believe that these snapshots are incremental, and even though each is listed (in the AWS console) of being of a size the same as the EBS volume they were created from, I think they're just recording changes, is that right?
This is definitely where my understanding of snapshots falls down though, as I don't understand how if we delete older snapshots we can be sure of retaining all needed data, and therefore I don't know how long we should bee hanging on to these.
UPDATE Moments later I found this, which seems to suggest I can literally delete all but the most recent with impunity. If that is the case and it's felt that this could be useful to others I can answer this myself, or if it's just too obvious feel free to close this.
I can literally delete all but the most recent with impunity
Assuming you don't need any data that was already deleted or overwritten on the volume when you took the most recent snapshot, that's true.
EBS snapshots are logically incremental -- not physically incremental. Here's the cleverness that explains the difference:
Snapshots of EBS volumes don't technically contain data... they contain lists of pointers to backed-up data blocks, which EBS stores in S3 on your behalf (and bills you for the storage of). With each new snapshot, if blocks are encountered on the volume that are unchanged from the prior snapshot, and thus already stored in S3 with the same content, they're not stored again -- the new snapshot just references the blocks already stored by another snapshot job... which is why you probably don't have an outlandish storage bill.
This is what I mean by "logically" incremental. The newly (since the last snapshot) changed blocks are preserved in S3 but they aren't really "in" the latest snapshot -- they're referenced by it, and by any future snapshot that's made, until they change.
EBS snapshots are completely filesystem agnostic. They have no idea about how the blocks are used, only that they changed between snapshots. Snapshots are a block-level (not file-level) operation, so, within whatever the granularity of blocks is,¹ if only part of a large file were changed, in place (without moving the file on the disk) then only the changed portion of the file would be newly backed-up. (A simple example would be a continually-growing log file).
When you delete snapshots, the blocks referenced by those snapshots are purged from S3 storage (stopping the billing for storage of those blocks) if and only if no other snapshots reference them. Otherwise, of course, they're preserved, because they are still needed.
If you delete all but the most recent snapshot, all of the blocks stored in S3 that are not needed to restore that one single snapshot would be purged, so your billable snapshot storage size would be exactly equal to the size of the volume, because only those blocks would remain in S3 storage. (Technically, it should be smaller since EBS apparently uses a reversible compression algorithm on snapshots but the details are not public, but in principle, an 8GB volume with the exactly one snapshot, references exactly 8GB of snapshot blocks).
This is why snapshot sizes always show the volume size in the console and API, instead of some kind of "incremental" size -- a snapshot doesn't "contain" any data, but it contains pointers to exactly enough backup data blocks to fill the volume with content identical to what existed on the volume when the snapshot job started. And this is where your "impunity" comes in.
Purging all those old snapshots, will purge some of the backup blocks, and will save you some money, depending on how much the volume changes between snapshots. If it changes very little, you'll have very little backup block storage that would be freed by purging them, and they aren't costing you that much.
Because of the risk of files being deleted, overwritten, etc., some period of days before the problem might be noticed... it seems wise to keep more than just one day, but that reasoning is unrelated to the way EBS snapshots work.
My policy, implemented through in-house automation, is to keep a daily snapshot each day for several days, pruning them down to weekly snapshot retention for several weeks, and finally retaining a monthly snapshot for each volume forever, or less, depending on retention policies. (My automation uses "magic" tags, on the volumes, to customize the retention and timing on a per-volume level, but that default policy is used on most volumes.)
Incidentally, with the talk of S3, it probably bears clarifying that EBS is S3's "customer" in this setup, not you -- that's why you can't see this backup data in S3.
¹ "whatever the granularity of blocks is" – By this, I mean the size of a "backup block" from EBS's perspective. This size is, as far as I know, undocumented, but my assumption is that a "block" in this context is almost certainly larger than the "block size" of the device as presented to the operating system, since a backup block size of single-digit KiB would result in an awkwardly large number of blocks to juggle, track, store, and reload.