Hot spare host vs cold spare host?

Sobrique explains how the manual intervention causes your proposed solution to be sup-optimal, and ewwhite talks about probability of failure of various components. Both of those IMO make very good points and should be strongly considered.

There is however one issue that nobody seems to have commented on at all so far, which surprises me a little. You propose to:

make [the current hot spare host] a cold spare, take the hard drives and put them in the primary host and change the RAID from 1 to 1+1.

This doesn't protect you against anything the OS does on disk.

It only really protects you against disk failure, which by moving from mirrors (RAID 1) to mirrors of mirrors (RAID 1+1) you greatly reduce the impact of to begin with. You could get the same result by increasing the number of disks in each mirror set (go from 2-disk RAID 1 to 4-disk RAID 1, for example), along with quite likely improving read performance during ordinary operations.

Well then, let's look at some ways this could fail.

  • Let's say you are installing system updates, and something causes the process to fail half-way; maybe there's a power and UPS failure, or maybe you have a freak accident and hit a crippling kernel bug (Linux is pretty reliable these days, but there's still the risk).
  • Maybe an update introduces a problem that you didn't catch during testing (you do test system updates, right?) requiring a failover to the secondary system while you fix the primary
  • Maybe a bug in the file system code causes spurious, invalid writes to disk.
  • Maybe a fat-fingered (or even malicious) administrator does rm -rf ../* or rm -rf /* instead of rm -rf ./*.
  • Maybe a bug in your own software causes it to massively corrupt the database contents.
  • Maybe a virus manages to sneak in.

Maybe, maybe, maybe... (and I'm sure there are plenty more ways your proposed approach could fail.) However, in the end this boils down to your "the two sets are always in sync" "advantage". Sometimes you don't want them to be perfectly in sync.

Depending on what exactly has happened, that's when you want either a hot or cold standby ready to be switched on and over to, or proper backups. Either way, RAID mirrors of mirrors (or RAID mirrors) don't help you if the failure mode involves much of anything aside from hardware storage device failure (disk crash). Something like ZFS' raidzN can likely do a little better in some regards but not at all better in others.

To me, this would make your proposed approach a no-go from the beginning if the intent is any sort of disaster failover.


Yes, it's a bit old school. Modern hardware doesn't just fail that often. Focus either on making your applications more highly-available (not always possible), or on the items needed to make your individual hosts more resilient...

For hosts:

  • Buy better hardware.
  • Ensure you have support contracts.
  • REGISTER your servers' support contracts (spare parts are stocked locally based on registration data!)
  • Use redundant power supplies, (hardware?) RAID, redundant fans.
  • If the server is not capable of accommodating the above redundant features, keep a spare chassis or components on hand to be able to self-repair in the event of failure.

In order of decreasing failure frequency, I see: disks, RAM, power supplies, fans most often... Sometimes system board or CPU. But those last two are where your support contract should kick in.


It's rather inefficient - not least because of the dependency on manual intervention to make the switch.

I have worked at places that run a hot DR site - literally, identical servers to the primary, ready to go instantly. However the DR switchover is an automated process - we're not talking cabling, a bit of fiddling and a switch, but a process when we press the button flips everything from one site to the other.

This approach is sickeningly expensive, but that's a business decision - acceptable risk vs. the money needed to deliver on the objective. As a rule, there's an exponential curve on recovery time objective - the nearer to zero it gets, the more it costs.

But that's what your question is about, really. What is your recovery time objective, and what is the most effective way of achieving it. Waiting for a server to boot will take a few minutes. How long does it take someone to do the adjustment and 'recovery tasks' when it goes pop at 4am?

And how long is an acceptable outage?

I would suggest that if you're doing 'hot recovery' you want to think clustering. You can be fairly cheap on clustering with good use of VMWare - 'failing over' to a VM - even from a physical - means you're not running redundant hardware. (Well, N+1 rather than 2N).

If your RTO is long enough, then switch the box off. You may find that the RTO is sufficient that a cold rebuild from backup is ok.


The fact that it is old school doesn't necessarily make the use of a hot spare a bad idea.

Your main concern should be the rationale, what are the risks you run, and how does running a hot spare mitigate them. Because in my perception your hot spare only addresses hardware failure, which is although not uncommon, neither the only operational risk you run, nor the most likely. The second concern is do alternative strategies provide more risk reduction or significant savings.

Running a hot spare with multiple manual fail-over steps will take long and is likely to go wrong, but I've also seem automated failover with HA cluster suites turning into major cluster f*cks.

Another thing is that hot or cold standby in the same location doesn't provide business continuity in case of local disaster.