For a server that can tolerate downtime, is it better to use a hot spare or a cold spare?
In other words, should I leave a spare disk on the shelf or set it up as a hot spare?
If the server has a little bit of downtime, it isn't the end of the world. It's not a webserver or anything that needs to serve stuff up 24/7. Are there any downsides (drive life/wear, etc) to keeping the extra hard drive set up as a hot spare or is it better to leave it packaged on a shelf somewhere if I don't require the immediate recovery time of a hotspare?
You can't test a cold spare for functionality; for all you know, that drive on the shelf doesn't work. Unless you're at capacity in your enclosure, use it as a hot spare.
It also means that you get back to full redundancy in the array without you having to go in; useful for holiday weekends etc.
When you say "a little bit of downtime" does that mean you can afford to just rebuild the server or recover from backup if you lose the RAID array itself that the hot spare would be assigned to if multiple drives fail?
Is the server and drives under a replacement warranty? What kind? (24x7x4 or 9x5xNBD or what?)
I'd choose it based on the following:
- You can restore from backup OR are willing to assume the risk of multiple drive failures AND you have a decent hardware warranty active = NO hot spare and NO cold spare at all (just replace the bad drive with the warranty service)
- COLD spare = I would use this option IF you have multiple servers using the same drive type and you want to save money by only having a single cold spare drive on the shelf in case any of those servers have a bad HDD.
- HOT spare = I would use this option IF the server is critical and you can't afford having to restore from backup OR don't want the risk associated with multiple drive failures OR you aren't diligent or have decent notification alerts to know that a drive went bad (nothing worse than a drive going bad on Monday and you don't know about it for 2 weeks when you decide to go back into the data center).
Hotspare has it´s own dangers, as it leads to an automatic rebuild.
With the size of arrays in the TBs, and the amount of stale data, there is a numerical chance the array finds another defect during automatic rebuild. This is aggravated by the long duration of rebuilds. 20hr rebuild? Thats a long time to wait before you can do the Right Thing (TM).
Therefore, it is better not to have automatic rebuild. In case of a drive failure, you want to verify the backup, and the failover mechanisms before you initiate the rebuild.
To reduce the chance of breakdown because of a drive failure in the time before the rebuild, you need an array that can handle 2 drive failures: Raid 6 or Raid 10.
Consider if the cost of both may be justified. Having a hot spare ready to cut over ensures rapid recovery, and having a cold spare on top of that ready to go can ensure a rapid return to redundancy.