How do large companies backup their data?

It depends on what your purpose is.

If you're looking for backups for disaster recovery (server exploded, datacentre burnt down, etc) then the short answer is they may not do backups at all. We have a client who deals in sensitive government data, and part of their mandate is that we are not permitted to do backups or backups onto removable media. We are permitted live replication to a DR site and that's it. Both sites are covered in the same level of physical and logical security. The catch here is that if I screw something up on Site A, then it's replicated to Site B almost instantly.

If you're talking about backups from a data integrity point of view (e.g. you accidentally dropped the Customers table and it's already replicated to the DR site), then LTO-5 tapes in a big tape library are often the go. With up to 3TB per tape, and multiple tapes in a tape library you can quickly back up vast amounts of data (quick here refers to Mbps, it may still take many, many hours to backup 25TB of data).

Any decent backup suite will do high compression and de-duping, which vastly reduces the amount of storage space required. I saw an estimate for a compressed and de-duped Exchange backup tool once that claimed a 15:1 ratio (15gb of data stored in 1gb of backups).

I very much doubt Google bother with backups for a lot of their search engine data, because most of it is replacable, and it's distributed so far and wide that if they lose even a significant portion, or perhaps even an entire, datacentre the system stays online thanks to failover BGP routes.


Actually, it looks like Google do back up a metric crap-ton of data onto tape, which isn't quite what I was expecting:

Part of the Google tape library


Most of their data is stored on their own GFS filesystem, and GFS requires that there are at least three copies of every 64 MB block that makes a file (GFS uses 64 MB blocks). Having that said, I don't think they bother with backups, as they have at least three copies of every file, and blocks on failing node can be quickly replaced by simply replicating data from any of remaining two good copies to a new node.

For more information, take a look at http://labs.google.com/papers/gfs.html