Solution 1:

Run fire drills ... every couple of months it is a good idea to say XYZ system is down ... then actually go through the motions of bringing it back online to a new VM etc etc. It keeps things honest and helps you catch mistakes.

Solution 2:

soapbox mode: ON

I would say that its as simple that backups that isn't tested regularly is worthless.

A my previous job we had a policy that every system (production, test, development monitoring etc.) should be test restored every 6 months.

This was also the job of the most junior admin so that documentation was up to date. Junior being defined by how much work he/she had don on the specific system, sometime (quite often actually) it was the "group manager" that did it

We had special hardware dedicated to this (one Intel and one IBM/AIX box) that was low spec for everything but diskspace, since we did not need to run anything real on the restored host.

Quite a lot of work the first couple of rounds but it led us to streamline the restore process which is the important part of backup.

Solution 3:

Since you seem to be referring to the fact that the administrator doesn't notice that the backup job "breaks", and not so much that a working backup did not work right, I would suggest building some sort of monitoring scripts around the backups.

When building a home-grown backup solution, I would do something like this:

  • Build a script to back up your data.
  • Perform test restoration to ensure the script works correctly.
  • In the script, or via some other means, implement a way to track the status of the backups (success, failure, ran, did not run).
  • Have that tracking status monitored (email, database, something)

Once all of that is done, you should be fine. One extra thing to do would be perform regular test restores. If you have extra hardware to donate to the cause that is.

Where I work we have a warm-site, once a month we randomly choose a system or database and go to our warm site and perform a test restoration exercise on bare-metal to ensure the ability to recover our data.

Honestly, if you data is very important to you, it would be in your best interest to invest in some software to manage your backups for you. There are hundreds of products out there for this, from the cheap and simple, to the enterprise class.

If you are relying on a set of hand-written scripts running in the crontab for your companies backups, sooner or later you will likely get burned.

Solution 4:

We have 60%-size 'Reference' versions of our 'Production' systems, we use them for final testing of changes, we restore 'Production' backups to these systems - it tests the backup plus ensures both environments are in step with each other.

Solution 5:

One approach is to script a "recovery" job to run periodically, for instance one that grabs a specific text file from the most recent backup and emails you its contents. If it's possible, this should -- at least sometimes -- be done using a different box than the one that created or backed up the data, just to ensure it will work if you should need to do so. The advantage is that you can be sure your encryption/decryption, compression, and storage mechanisms are all working.

This is a little more involved for specialized backups such as email and database servers, though performing some kind of small-scale recovery from a small DB or brick-level mailbox backup and verifying the contents is certainly possible, just a little more involved.

This approach also shouldn't replace a periodic full restore to ensure you can recover data in the event of an emergency -- it just allows you to be a little more confident about the integrity of your day-to-day backup job.