How to do restore test with limited testing resources?

How does a small organization with limited resources go about doing a restore test of its data backup system?

The cajoling of "Test your backups!" seems unrealistic when faced with the reality of what a full-scale restore test would involve, without affecting the mainline systems.

 

Assume the organization doesn't have tens of thousands of dollars worth of reserve server capacity just lying around to allocate for a temporary spin-up of a full test environment to verify the nightly backups are restorable.

Is there a way to justify purchasing all the mainline hardware a second time, just to do annual restore testing, but it otherwise sits there in storage, powered down and not used?

 

It has been suggested in other Server Fault discussions on media restore testing, to use a separate tape drive to confirm that media is usable in another device.

For a small site with only a few servers and a single production tape drive, it seems hard to justify buying an additional LTO-7 tape drive for thousands of dollars and additional licensing for the backup software to go with it, just to use it for a once-per-year media restore / test environment verification process and then stick it on a shelf and don't use it until next year's test process.


Solution 1:

You test your backups primarily to test your restore procedures so that when you're in crisis situation you'll know exactly what to do and when everybody will be panicking you'll be competent, confident, calm and will know exactly what to do and roughly how long the restore will take etc. etc. because by then restoring backups is a routine event.

The second thing you probably want to do is test data integrity, once you restored your critical data can production be resumed? Is nothing corrupted or incomplete?

You can and probably should test both of those things one small piece at a time. Only once you have the basics down should you attempt restoring a whole datacenter.

If you make backups of file systems and network shares for instance a suitable test would be to restore a specific directory at an alternate location and compare file-sizes, hashes and permissions with the original.

The next time you need to clone a database for testing, instead restore a production database from back-up.

Do a "bare-metal" OS restore on a VM if need be.


But backups and restores are just one aspect of a larger disaster recovery strategy and business continuity plan.

What will your business do when your current location would be lost due to natural disaster (fire, flooding, hurricane etc.)? Can it continue to operate from other existing locations, or is yours is the only location, will the business simply go bankrupt or will insurance money be used to rent emergency offices/containers?

That was the BCP strategy a couple of years ago at one company: a contract with HP, or maybe IBM at the time, to supply a datacenter in a container once a year for complete datacenter disaster recovery tests and having that on standby as well in case of acute disasters.

That company had 1 office facility and only tapes off-site (or maybe a tape-robot) and everything else in-house. The idea was that renting temporary furnished office space, getting internet connectivity and rerouting telephone numbers, getting desktops and printers etc. would be mostly commodity and easy to arrange. But IT slightly less so. The cost-benefit calculations for a twin-datacenter were unfavourable.

So initially every 6 months, but afterwards once a year, they did do a complete BCP test, but on temporary rented hardware: deploying VMWare, restoring the back-up server, restoring VM's with AD domain controllers, mail servers, database & application servers and file-shares.

A more contemporary BCP strategy could be cloud based and with both an off-premises backup copy online and you test your DR restore in the cloud as well, if you only need them a couple of days even a fairly large number of VM won't break the bank.

Solution 2:

To paraphrase an old adage

disaster is certain, restore - not quite

In short, backup and restore tests are absolute needs. To have a good backup and restore plan, I would like to stress the following points:

  • be clear in comunicating to management that a periodical restore is a true need. This is often the hardest part, as management view anything not having a direct, immediate benefit as a superfluous. The sad reality is that their data are at risk, and they need to understand that periodical restores, albeit with an associated cost, is a worthwhile investment.
  • on your part, try very hard to avoid proprietary binary blobs for storing your backups: they can be hardly inspectable and provide little to none partial recovery possibility. Strongly prefer open, inspectable file format (as tar) or, even better, use rsync (or similar tool) to have a filesystem-level backup of your data. With such tools you can very easily inspect your backup and have at-a-glance idea if all (or most) is present/accessible or not.
  • for fast restores, try to have a binary image (via snapshot) of your critical virtual machines. This has the added advantage of being immediately inspectable by simply importing/launching it on any workstation equipped with compatible virtualization software (nowadays all major virtualization platform have free trial version which fit the bill quite well for this kind of "cheap" restoring)
  • for databases, use the appropriate dumping tools and restore it inside a virtual machine, then ask a single user to use your restored database and to do a quick inspection to see if the application works and if recent data (ie: yesterday) are present
  • when your backup and restore procedure works, document it: when something will go wrong, you will have a very clear operational plan to follow, which decrease stress and increase success chances.

For fast, cost-effective restores it is critical to make ample use of temporary virtual machines, run on cheap hardware (read: retired servers or workstation). If disk space is a problem, do wide use of thin provisiong. If available RAM is the problem, restore only a small VM subset (even a single one) each time.