Do snapshots + RAID count as a good on-site backup solution?

The two main reasons I can think of for taking backups seems to be taken care of when I use both snapshots and RAID together with btrfs. (By RAID here, I mean RAID1 or 10)

  • Accidental deletion of data: Snapshots covers this case
  • Failure of a drive and bit rot
    • Complete failure: RAID covers this case
    • Drive returning bad data: RAID + btrfs' error correcting feature covers this case

So as an on-site backup solution, this seems to work fine, and it doesn't even need a separate data storage device for it!

However, I have heard that both RAID and snapshots aren't considered proper backups, so I'm wondering if I have missed anything.

Aside from btrfs not being a mature technology yet, can you think of anything I've missed? Or is my thinking correct and this is a valid on-site backup solution?


Solution 1:

No, it's not.

What happens when your filesystem or RAID volume gets corrupted? Or your server gets set on fire? Or someone accidentally formats the wrong array?

You lose all your data and the not-real-backups you thought you had. That's why real backups are on a completely different system than the data you're backing up - because backups protect against something happening to the system in question that would cause data loss. Keep your backups on the same system as you're backing up, and data loss on that system can impact your "backups" as well.

Solution 2:

For on-site backup, snapshot might be good enough, provided that you regularly 'export' your snapshot somewhere else, where it exists as passive data.

And, regularly test if your 'shipped snapshot' can be restored.

This is how I implemented a quick backup of some of my servers: store the data on ZFS, take a ZFS snapshot, send the delta to another server, where the whole filesystem is re-created (minus the actual service running).

Of course, the best backup is always off-site. Thus, after 'shipping' the snapshot(s) to a separate system, do a 'tape-out' of the snapshots regularly.

So, in my system, the server that receives the snapshot deltas, regularly dumps all its ZFS pools (including earlier snapshots) to tape.

And of course, test your tape-outs to ensure it can be restored.

Note: You will want the snapshot to take place during quiesced disk activity, and preferably in coordination with the database (if any) to ensure consistency; else, the cure might be worse than the illness. That's why NetApp & EMC 'live snapshot' feature is very useful: They will postpone a LUN's snapshot until the database using the LUN indicated that it's safe to carry out the snapshot.

Solution 3:

What HopelessN00b said. No.

Proper backups are on a separate device than the device being backed up. What happens when you lose two or more drives? What happens when your server room burns down? What happens when someone accidentally destroys your array?

(Anecdote alert: I once heard of someone who had PXE set to auto-install the latest Fedora. His UPS failed. After a power outage, his server rebooted and was set to PXE boot and... installed Fedora over his data. My point? Freakish things happen. Fortunately, he had proper backups.)

Preferably, you have at least three copies of your data, one stored completely offsite in case the data center burns down.

Solution 4:

Properly implemented snapshots MUST be supported by your storage as decent backups do use them as a very first stage of creating a backup job. It's however a bad idea to use snapshots for primary backup. Reasons:

1) Snapshots and backend storage CAN fail. So real backups must be using separate spindle set or there's a great chance to lose both primary working set and backup data @ the same time.

2) Snapshots "chew away" usable space. It makes sense to use expensive and fast storage for current hot data and off-load snapshots and backups being an ice cold data to some cheaper and slower storage. It works very well with 1) BTW.

3) Snapshots usually slow down the whole proces. Most systems use Copy-on-Write and this approach creates fragmentation. Redirect-on-Write are faster but eat A LOT of space. Very few vendors have properly implemented snapshots. NetApp with WAFL and Nimble Storage with CASL (I'm not affilated with any of them). Pretty much everybody else have issues. For example Dell Equallogic trigger 15 MB page update (and waste) on every single byte changed. That's EXPENSIVE.