Is there a good way to backup a petabyte of data and store it?

I'm starting to see clients with hundreds of terabytes of data (in SQL Server installations). As the total volume of data in some enterprises approaches meaningful fractions of a petabyte, I'd like to canvas the collective knowledge base out there to see what people dealing with that magnitude of data are doing to safeguard it.

The obvious issue is that storing multiple backups of that much data is prohibitively expensive, using enterprise-class storage, heck, even just RAID-5.

Options I see are as follows:

  1. Create a mirror copy of the data in another data-center, and continually ship differences to it (using whatever mechanism is available for your data source - e.g. log-shipping or database mirroring with SQL Server)
  2. Take regular backups using a hefty compression algorithm (probably only suitable if the data lends itself well to being heavily compressed)
  3. Take piecemeal backups of the critical/changing parts of the data.
  4. Don't backup the data and trust the corruption-gods.

I'm seeing option #4 being adopted as the default, and as an HA/DR expert it's really scary, but what do I advise as an alternative? I think #1 is the best approach, but "I don't think so" is the usual answer when any alternatives apart from #4 and possibly #3 are suggested.

Now, of course it depends on the change-rate and criticality of the data. No need to answer with that as I used to be responsible for all the HA features of SQL Server while I worked at Microsoft so I'm well-versed in the 'it depends' arguments - that's my catch-phrase :-)

I'd be very interested to hear of any alternatives I've missed, or to hear that everyone else is in the same boat and there's no realistic alternative to spending lots of money on more storage.

Thanks in advance - due credit will be given to all well thought-out and expressed answers.


Solution 1:

Off the wall idea - is the all of the stored information needed or even useful?

How much is the information actually worth? It seems obviously ridiculous to spend more in upkeep and management than the data is worth.

Is the data in the database appropriate for storage in a database? For example, does keeping compressed multi-gigabyte core files in the support organization's database really provide any actual benefit?

Is there a lot of duplicated data in the database? For example, are a thousand people keeping ten copies each of a weekly 10MB newsletter?

Does some of the data have an "expiration date" after which it does not provide any value? Returning to the support organization example, for various reasons there is virtually no benefit in keeping around customer core files more than a few months after a fix has been delivered.

Another thought - is keeping that much data opening the company to liabilities. Some data one must, by law, keep. Some data, however, should be "shredded" because of the risks posed if it is accidentally, or maliciously, released to inappropriate parties.

Solution 2:

Yeah, another option is storage virtualization: a device that sits between your servers and the SAN, like IBM SVC. SVC manages SAN-to-SAN copies, and can do remote replication (although that's obviously pretty painful at the petabyte level unless you have really low data change rates and really high bandwidth.)

The slick part is that the whole process is invisible to the servers involved. If you're using SQL Server, you design your filegroups to keep things with a low change rate together (like sales archives from >3 years ago), and things with a high change rate (like current sales) on a separate filegroup. They don't even have to be completely read-only - you just want to design it so that you can use different replication methods for each filegroup. The SAN gear can sync luns via network, tape, or via SANs - meaning, you can ship parts of the SAN back and forth. This is more effective with gear like LeftHand's, where the SAN is made up of a pool of participating units.

Then you can sync the low change rate stuff over the wire automatically, and sync the high change rate with sneakernet. (Sounds like I've got that backwards, but it's true - you can't sync the high change rate stuff over the wire due to volume.) Even some of the low-end gear accommodates this now: LeftHand lets you replicate to other LeftHand units in your datacenter, and then ship them to your offsite datacenter. Plug 'em in, join them to the remote side by changing IPs and groups, and now they're part of your remote backup SAN. The LeftHand sales pitch on this is just brilliant: set up your two SANs side-by-side in your primary datacenter, get them in sync, then you can ship parts of them over to the remote datacenter while some of them stay in your current datacenter to keep in sync. Gradually move 'em over without getting way out of sync.

I haven't done this at the petabyte level, though. You know what they say - in theory, in theory and in practice are the same. In practice...

Solution 3:

Option 1 is mirroring, which is almost as bad as #4: any bug that corrupts data, and isn't discovered immediately, will corrupt both copies.

If the data is critical, consider dedicated solutions; read about IBM's Shark products, for example, or competing products from EMS, etc. They have features like Flash-copy, that allow you to instantly create a logical copy of the file without doubling disk requirements; and then you can backup this copy to (e.g.) tape. Look into robotic tape backup as well.