Can NetApp Snapshots be used as Backups?
Backups serve two functions.
- First and foremost, they're there to allow you to recover your data if it becomes unavailable. In this sense, snapshots are not backups. If you lose data on the filer (volume deletion, storage corruption, firmware error, etc.), all snapshots for that data are gone as well.
- Secondly, and far more commonly, backups are used to correct for routine things like accidental deletions. In this use case, snapshots are backups. They're arguably one of the best ways to provide this kind of recovery, because they make the earlier versions of the data available directly to the users or their OS as a .snapshot hidden directory that they can directly read their file from.
No retention policy
That said, while we have snapshots and use them extensively, we still do nightly incrementals on Netbackup to tape or data domain. The reason is that snapshots can not reliably uphold a retention policy. If you tell users that they will be able to back up from a daily granularity for a week then a weekly granularity for a month, you can't keep that promise with snapshots.
On a Netapp volume with snapshots, deleted data contained in a snapshot occupies "snap reserve" space. If the volume isn't full and you've configured it this way, you can also push past that snapshot reserve and have snapshots that occupy some of the unused data space. If the volume fills up, though, all the snapshots but the ones supported by data in the reserved space will get deleted. Deletion of snapshots is determined only by available snapshot space, and if it needs to delete snapshots that are required for your retention policy, it will.
Consider this situation:
- A full volume with regular snapshots and a 2 week retention requirement.
- Assume half of the reserve in use for snapshots based on the normal rate of change.
- Someone deletes a lot of data (more than the snapshot reserve), drastically increasing the rate of change, temporarily.
At this point, your snapshot reserve is completely used, as is as much of the data free space you've allowed OnTap to use for snapshots, but you haven't lost any snapshots yet. As soon as someone fills the volume back up with data, though, you'll lose all the snapshots contained in the data section, which will push your recovery point back to the time just after the large deletion.
Summary
Netapp snapshots don't cover you against real data loss. An errant deleted volume or data loss on the filer will require you to rebuild data.
They are a very simple and elegant way to allow for simple routine restores, but they aren't reliable enough that they replace a real backup solution. Most of the time, they'll make routine restores simple and painless, but when they're not available, you are exposed.
They are a backup, yes. I've personally used them in place of daily incrementals before, but we still did weekly fulls to tape.
They protect quite well from any non-netapp (systems accessing volumes) user or admin errors or problems.
They do not protect from catastrophic hardware failures of the netapp itself. My understanding is that SnapMirror does copy all of the data (in the snapshot) to the other filer[1], so SnapMirroring to another filer should protect that dataset from catastrophic failure of a single filer.
The one major problem, of course, is that if somebody managing the netapp deletes the volume, then all the snapshots go with it. SnapMirror to another filer should adequately protect against that.
If all your NetApp filers are in the same data center, then you don't have anything covering a major disaster, the way that tape backups shipped offsite would give you.
You'll get better backups of your VMs and any databases (or database-like things) if you use the appropriate SnapManager agent, which will coordinate quiescing the data briefly as the snapshot is taken. If a given VM and its data is contained entirely within a single NetApp volume, then the snapshot of that VM should be crash-consistent. That is, it should be just as good as if you pulled the plug on a server and imaged the drive, which would typically mean filesystem checks and the database equivalents. If a database's data is split between LUNs, it seems like there's a significant risk of data corruption.
If it were me, I'd set up all databases to do regular backups to local disk, and set those jobs to keep a copy or two. That gives you a much better guarantee of recoverability.
[1] http://www.netapp.com/us/system/pdf-reader.aspx?m=snapmirror.pdf&cc=us
You should go read @Basil's excellent answer right now but here is my two cents:
Snapshots are not application aware
Just because you take a snapshot of the underlying storage volume does not mean the data on that volume is recoverable. MS SQL is a great example of this - you need to make sure your database is transaction-consistent before you snapshot the storage it is using otherwise as @freiheit mentioned you are no better off than recovering from a hard down failure. DBAs love to using different LUNs for different parts of SQL to better utilize the storage system, temp databases on fast storage, system databases on slower storage, read-only or archived data on bulk storage, and working data somewhere in between. If you are just snapshotting those volumes it is highly unlikely you will be able to recover your database.
NetApp supplies a number of Snap tools to make snapshots application aware. SnapManager for SQL provides that awareness. In the Microsoft ecosystem I believe there are also SnapManager tools for Exchange and SharePoint. SnapDrive does not have this application awareness. It just provides a convenient method to manage storage within the guest.
If you are storing all your IIS data and configuration on LUNs and snapshoting those LUNs directly you cannot guarantee that data is recoverable. Ask me how I know...
Multiple storage types can have different snapshots schedules
If you are presenting storage to your servers in different ways this can complicate your snapshot and recovery picture. NetApp's ONTAP is a multi-protocol offering and it is very possible you are using more than one method or storage type for a particular server. In our shop some of our server's get their C:\ drive over an NFS-based datastore and their "storage" drives over Raw Device Mapped LUNs. We were taking snapshots of the RDM LUNs but not the NFS-based datastores. This made recovering the server, difficult.
Snapshots do not have a guaranteed retention policy
Again, @Basil really covers this well but it's worth reiterating. It is possible to fill up your Snap Reserve in such a way where Snpashot Autodelete removes snapshots that have not naturally aged to deletion. Again. This can be really bad if you or your customers are expecting three weeks of snapshots to be available.
Snapshots are inline
This is the drawback of integrated storage... it's well... integrated. Your snapshots reside on the same platform you are backing up. If the volume or the Filer it is on disappears so does your backup. You can mitigate this somewhat by copying the snapshots to another Filer using SnapMirror as I erroneously stated in my question that the SnapMirror copy is not a full copy.
Snapshots enable bad operational practices to continue
One thing that I have noticed is that snapshots enable managers and customers to continue terrible operations behavior. In our environment we have very poor documentation and configuration management practices. This means that most servers start with the same base (a template or an image) but are then configured manually by different groups of people. As they continue their life, the servers diverge further and further from the template in ways that are generally not documented or implemented with configuration management.
And then come snapshots! We don't need to step back and address some of our fundamental operational practices because we can just snapshot all our servers! And we can use SnapMirror to move those snapshots off-site so we can use them as backups!
I think this is the wrong lesson to learn here. A better lesson to learn is that the configuration management framework, even if it is as simple as a changelog, should be backed up for the purposes of bare-metal restore. Snapshots are a fantastic tool but I can there is a temptation to be overly reliant on them to the determent of important fundamentals.