What's needed for a complete backup system?

We build backup systems for one purpose: To enable restores. Nobody cares about backups; they care about restores.

There are three reasons one might need to restore file(s): Accidental file deletion, hardware failure, or archival/legal reasons. A "complete" backup system would enable you to restore files in all of these scenarios.

For accidental file deletion, things like Dropbox and RAID fail because they simply reflect all changes made to the filesystem, and a deleted file is gone in these scenarios. Your backup system should be able to restore a file to a recent point in time fairly quickly; preferably the restore would complete within seconds to minutes.

For hardware failure, you should use solutions such as RAID and other high-availability approaches when possible to ensure that your service remains up and running, as a full restore of a system can take hours or possibly days due to the necessity of reading and writing to (relatively) slow media.

Finally archives, or full backups (or equivalent) of the systems at a specific point in time, can serve restores in both legal and disaster recovery scenarios. These would typically be stored off-site, in case a stray meteor turns your data center into a smoking crater...

Your complete backup system should be able to support restores for any of these three types, with varying levels of service (SLA). For instance, you may decide that a deleted file may be restored with one business day granularity for the last six months and one month granularity for the last three years; and that a disk failure should be capable of being restored within four hours with no more than two business days of data loss. The backup system must be able to implement the SLA in a backup schedule.

Your backup system must be fully automated. This cannot be stressed enough. If the backups aren't fully automated, they simply won't happen. Your backup system must be capable of fully automated backups, out of the box, with little or no special configuration or scripting required.

You must periodically test restores. Any backup system is utterly useless if restoring from backup fails to work. I think most of us have horror stories along these lines. Your backup system must be able to restore single files or whole systems within the SLA you're implementing.

You must purchase backup media on an ongoing basis. Whether you're just doing on-site tape backup or going whole hog with off-site cloud backup, make sure you have it in the budget to pay for the gigabytes (or terabytes!) of space you will need.


This has been a very brief summary of a portion of Chapter 26 of The Practice of System and Network Administration, Second Edition, which anyone who is or aspires to be a system administrator should own, read, and memorize.

I've glossed over a lot of things that don't necessarily apply to your particular situation or that don't make sense in a small environment such as the one you've described. Nevertheless it should be a reasonable description of the features that your "complete" backup system should have, as well as why they're necessary.


  1. DropBox would be a risky way of doing backups. No SLA/QoS, and it's probably against their normal TOS to dump that much data to their servers in an automated fashion. They specifically disclaim any liability in accessing your data - they may cut off access, destroy data, or go bankrupt at their own discretion and without warning.
  2. No backup procedure is "valid" until you've actually restored from it, it's the only way to be sure. Many most backup software provides a "validate" feature, this is worse than useless for most people as it only validates that something was written to a backup medium, not that the something is actually useful in restoring an operational system.
  3. Relentlessly-complete documentation ensures you'll be able to follow the restore procedures when disaster does strike - testing documentation should be a part of testing the restoration of your system. Also, that someone else will be able to complete the procedures should you get hit be a bus (Murphy's Law and all that).
  4. Restoration is only useful if it can be accomplished in a meaningful time period. Eg, If it took a year to restore your data that would be useless. You should determine what time frames are necessary for your situation for three levels of functionality: minimal functionality, daily operations, complete. Test your proposed solution, see if fits the time requirements.

Naw. You're good for now. At least with the concepts...

  • Think about the state of your system at the time of your backups. Perhaps you don't want to backup a live database...
  • Or think about your hardware. Are you doing everything you can to make the machine as resilient as possible? For instance, I want restoring from a backup to be the LAST thing I have to do in an emergency situation.
  • Outages and small service outages can be reduced by using quality hardware, so make sure you're using RAID, server-class equipment, and looking at a more local approach to data protection.
  • Think about the types of failures and situations you're protecting against.
  • I wouldn't necessarily use DropBox, but the idea of offsite protection is correct.

My preferred, tried and true backup system is:

  1. Hourly snapshots of all databases (and one snapshot archived per day for two weeks, one snapshot archived per week for a year.)
  2. Disposable servers. That is, all server standups are stored in git and deployed automatically (very similar to what you're saying with puppet, our preferred tool is chef though.) Essentially, a new server can be stood up from scratch using only the code you have in git, meaning any development hosts are built in similar fashion as your production servers.

The puppetmaster or chef server in these cases can be a potential point of failure; again, automate rebuilding them as much as possible, and have scripts on hand to allow existing nodes to bootstrap to a new server management host as quickly as possible, in the event that the old box is knocked over. I've found it can sometimes take significantly longer to rebuild this sort of host from a backup, than to stand a new one up from scratch (and restoring from backups can unintentionally reintroduce the same flaws or issues that caused it to go down in the first place.)

On a different vein, if you have more than a couple of servers, hosts, etc, it's well worth the investment to use a central log server. If they're housed (and backed up) from one source, it saves you the headache of having logs on the rest of your hosts piling up and taking space. Log data is gold, but if I have 20 api servers all serving up traffic, and I get hit with something like a DDoS, not having aggregation of my logs means I'm looking for a needle in a haystack. If you're going to store your infrastructure logs (and you should!) then store them once, on one robust backup platform.

G'luck~!


RAID, & services like dropbox "back up" all your changes. Including the mistakes you'd want to recover from by using a backup.

This is why all us sysadmin types are getting very very antsy about why things like RAID or toytown cloud file storage services that rely on mirroring changes to your files as they happen are not backups. That's not to say these services are not useful. They are, but they're not backups because they don't really give you data integrity.

A backup should be a snapshot of how things were at the time the backup was taken, not a continually over-written live log of all the good and bad things that happen to your data as it happens. There are cloud providers that will give you actual backup out there if you look, and they work differently to dropbox/skydrive type services.

When it comes down to it, it's your choice what kinds of risk you're willing to expose yourself to vs. your budget for mitigating those risks. If you think that something like Dropbox is good enough then that's up to you. But you need to be clear about what it will and will not do for you - please don't kid yourself that it's a real backup.