What counts as a 'large' raid 5 array?
Designing the reliability of a disk array:
- Find the URE Rate of your drive (manufacturers don't like to talk about their drives failing, so you might have to dig to find this. It should be 1/10^X where X is around 12-18 commonly).
- Decide what is an acceptable risk rate for your storage needs†. Typically this is <0.5% chance of failure, but could be several percent in a "scratch" storage, and could be <0.1 for critical data.
-
1 - ( 1 - [Drive Size] x [URE Rate]) ^ [Data Drives‡] = [Risk]
For arrays with more than one disk of parity or mirrors with more than a pair of disks in the mirror, change the1
after the Drives in Array to the number of disks with parity/mirror.
So I've got a set of four 1TB WD Green drives in an array. They have a URE Rate of 1/10^14. And I use them in as scratch storage. 1 - (1 - 1TB x 1/10^14byte) ^ 3
=> 3.3%
risk of failure rebuilding the array after one drive dies. These are great for storing my junk, but I'm not putting critical data on there.
†Determining acceptable failure is a long and complicated process. It can be summarizes as Budget = Risk * Cost
. So if a failure is going to cost $100, and has a 10% chance of happening then you should have a budget of $10 to prevent it. This grossly simplifies the task of determining the risk, the costs of various failures, and the nature of potential prevention techniques - but you get the idea.‡[Data Drives] = [Total Drives] - [Parity Drives]
. A two disk mirror (RAID1) and RAID5 has 1 parity drive. A three disk mirror (RAID1) and RAID6 has 2 parity drives. It's possible to have more parity drives with RAID1 and/or custom schemes, but atypical.
This statistical equation does come with it's caveats however:
- That URE Rate is the advertised rate and is commonly better in most drives rolling off the assembly line. You might get lucky and buy a drive that is orders of magnitude better than advertised. Similarly you could get a drive that dies of infant mortality.
- Some manufacturing lines have bad runs (where many disks in the run fail at the same time), so getting disks from different manufacturing batches helps to distribute the likelihood of simultaneous failure.
- Older disks are more likely to die under the stress of a rebuild.
- Environmental factors take a toll:
- Disks that are heat cycled commonly are more likely to die (eg. powering them on/off regularly).
- Vibration can cause all kinds of issues - see video on YouTube of IT yelling at a disk array.
- "There are three kinds of lies: lies, damned lies, and statistics" - Benjamin Disraeli
The reason that article exists is to draw attention to Unrecoverable Bit Error Rates on HDDs. Specifically, your cheap 'home PC' disks. They typically have a factory spec of 1 / 10^14. This is about 12.5TB of data, which if you are doing a RAID-5 with 2TB disks ... you hit quite quickly.
This means you should either:
- use smaller RAID groups, and accept higher wasted space.
- Use RAID-6 and accept the additional write penalty. (50% higher than RAID5)
- Buy more expensive disks - 'server grade' have an UBER spec of 1 / 10^16, which means this is a moot point. (1.2PB is better than 12.5TB)
I would suggest typically that RAID-6 is the way forwards generally, but it'll cost you performance.