Is it better practice to buy RAID disks individually vs. in bulk?
This may sound like an odd question, but it's generated some spirited discussion with some of my colleagues. Consider a moderately sized RAID array consisting of something like eight or twelve disks. When buying the initial batch of disks, or buying replacements to enlarge the array or refresh the hardware, there are two broad approaches one could take:
- Buy all the drives in one order from one vendor, and receive one large box containing all the disks.
- Order one disk apiece from a variety of vendors, and/or spread out (over a period of days or weeks) several orders of one disk apiece.
There's some middle ground, obviously, but these are the main opposing mindsets. I've been genuinely curious which approach is more sensible in terms of reducing the risk of catastrophic failure of the array. (Let's define that as "25% of the disks fail within a time window equal to how long it takes to resilver the array once.") The logic being, if all the disks came from the same place, they might all have the same underlying defects waiting to strike. The same timebomb with the same initial countdown on the clock, if you will.
I've collected a couple of the more common pros and cons for each approach, but some of them feel like conjecture and gut instinct instead of hard evidence-based data.
Buy all at once, pros
- Less time spent in research/ordering phase.
- Minimizes shipping cost if the vendor charges for it.
- Disks are pretty much guaranteed to have the same firmware version and the same "quirks" in their operational characteristics (temperature, vibration, etc.)
- Price increases/stock shortages unlikely stall the project midway.
- Each next disk is on-hand the moment it's required to be installed.
- Serial numbers are all known upfront, disks can be installed in the enclosure in order of increasing serial number. Seems overly fussy, but some folks seem to value that. (I guess their management interface sorts the disks by serial number instead of hardware port order...?)
Buy all at once, cons
- All disks (probably) came from the same factory, made at the same time, of the same materials. They were stored in the same environment, and subject to the same potential abuses during transit. Any defect or damage present in one is likely present in all.
- If the drives are being replaced one-at-a-time into an existing array and each new disk needs to be resilvered individually, it could be potentially weeks before the last disk from the order is installed and discovered to be faulty. The return/replacement window with the vendor may expire during this time.
- Can't take advantage of near-future price decreases that may occur during the project.
Buy individually, pros
- If one disk fails, it shares very little manufacturing/transit history with any of the other disks. If the failure was caused by something in manufacturing or transit, the root cause likely did not occur in any other disk.
- If a disk is dead on arrival or fails during the first hours of use, that will be detected shortly after the shipment arrives and the return process may go more smoothly.
Buy individually, cons
- Takes a significant amount of time to find enough vendors with agreeable prices. Order tracking, delivery failure, damaged item returns, and other issues can be time-consuming to resolve.
- Potentially higher shipping costs.
- A very real possibility exists that a new disk will be required but none will be on-hand, stalling the project.
- Imagined benefit. Regardless of the vendor or date purchased, all the disks came from the same place and are really the same. Manufacturing defects would have been detected by quality control and substandard disks would not have been sold. Shipping damage would have to be so egregious (and plainly visible to the naked eye) that damaged drives would be obvious upon unpacking.
If we're going simply by bullet point count, "buy in bulk" wins pretty clearly. But some of the pros are weak, and some of the cons are strong. Many of the bullet points simply state the logical inverse of some of the others. Some of these things may be absurd superstition. But if superstition does a better job at maintaining array integrity, I guess I'd be willing to go along with it.
Which group is most sensible here?
UPDATE: I have data relevant to this discussion. The last array I personally built (about four years ago) had eight disks. I ordered from one single vendor, but split the purchase into two orders of four disks each, about one month apart. One disk of the array failed within the first hours of running. It was from the first batch, and the return window for that order had closed in the time it took to spin everything up.
Four years later, the seven original disks plus one replacement are still running error-free. (knock on wood.)
In practice, people who buy from enterprise vendors (HPE, Dell, etc.) do not worry about this.
Drives sourced by these vendors are already spread across multiple manufacturers under the same part number.
An HP disk under a particular SKU may be HGST or Seagate or Western Digital.
Same HP part number, variation on manufacturer, lot number and firmware
You shouldn't try to outsmart/outwit the probability of batch failure, though. You're welcome to try if it gives peace of mind, but it may not be worth the effort.
Good practices like clustering, replication and solid backups are the real protection for batch failures. Add hot and cold spares. Monitor your systems closely. Take advantage of smart filesystems like ZFS :)
And remember, hard drive failures aren't always mechanical...
In deference to the answer from ewwhite, some sysadmins do order in batches. I would never, myself, order drives on an individual basis, but standard ops at the last place I worked in such a capacity was to order drives in batches. For a twelve drive machine, SOP dictated that the drives be split into three batches, giving the machine a three tier redundancy profile.
However, other small outfits that I have consulted at have followed different protocols, some not concerned with the batch, and others splitting batches into two or four arrays. The short answer is do what feels appropriate for the level of service you need to achieve.
Side note: The last place I worked was certainly doing the right thing. The app storage machine decided to fail on an entire batch of drives, and we discovered that this particular batch all had the same fault. Had we not followed a batch protocol, we would have suffered a catastrophic loss of data.
Honest answer from someone that's spent a lot of time dealing with dying raid arrays and difficult drives: Don't have all your drives from the same batch if you can avoid it.
My experience only applies to spinning disks, SSDs have their own issues and benefits to consider when bulk ordering.
Exactly the best way to handle things depends mostly on how big the array you're working with is, if you're working with something like 6 drive arrays with 2 drive redundancy you can probably safely buy similar drives from 3 manufacturers and split the array like that.
If you're using an odd drive or you're working with arrays that can't be easily partitioned like that you can try other approaches like buying the same drive from different vendors, or if you're buying in bulk you can look through and try to separate the drives based on likelihood of being manufactured together.
If you're running a small enough array with the right underlying tech it might even be worth your time to build it incrementally from heterogeneous disk supplies. Start with the minimum number of drives you can get away with and buy the next supply a month or two later, or when you fill the system. That also let's you get a feel for any issues that there might be with the particular models you picked.
The reason behind this advice is a combination of two quirks of drives.
-
MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.
If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.
-
Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.
If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.
For some context on how wildly the longevity of drives varies, Backblaze do a regular drive failure stat report... I'm not affiliated with the company in any way but they should know what they're talking about on the subject of drive reliability. An example is https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/ ... your sample set will likely be smaller, so outlying data can mess up your own experience, it's still a good reference.