Formula to calculate probability of unrecoverable read error during RAID rebuild
I want to compare the reliability of different RAID systems with either consumer (URE/bit = 1e-14) or enterprise (URE/bit = 1e-15) drives. The formula to have the probability of success of a rebuild (ignoring mechanical problems, that I will take later into account) is simple:
error_probability = 1 - (1-per_bit_error_rate)^bit_read
Important to remember is that this is the probability of getting AT LEAST one URE, not necessarily only one.
Let's suppose we want 6 TB usable space. We can get it with:
RAID1 with 1+1 disks of 6 TB each. During rebuild we read back 1 disk of 6TB and the risk is: 1-(1-1e-14)^(6e12*8)=38% for consumer or 4.7% for enterprise drives.
RAID10 with 2+2 disks of 3 TB each. During rebuild we read back only 1 disk of 3TB (the one paired with the failed one!) and the risk is lower: 1-(1-1e-14)^(3e12*8)=21% for consumer or 2.4% for enterprise drives.
RAID5/RAID Z1 with 2+1 disks of 3TB each. During rebuild we read back 2 disks of 3TB each and the risk is: 1-(1-1e-14)^(2*3e12*8)=38% for consumer or 4.7% or enterprise drives.
RAID5/RAID Z1 with 3+1 disks of 2 TB each (often used by users of SOHO products like Synologys). During rebuild we read back 3 disks of 2TB each and the risk is: 1-(1-1e-14)^(3*2e12*8)=38% for consumer or 4.7% or enterprise drives.
Calculating the error for single disk tolerance is easy, more difficult is calculating the probability with systems tolerant to multiple disks failures (RAID6/Z2, RAIDZ3).
If only the first disk is used for rebuild and the second one is read again from the beginning in case or an URE, then the error probability is the one calculated above square rooted (14.5% for consumer RAID5 2+1, 4.5% for consumer RAID1 1+2). However, I suppose (at least in ZFS that has full checksums!) that the second parity/available disk is read only where needed, meaning that only few sectors are needed: how many UREs can possibly happen in the first disk? not many, otherwise the error probability for single-disk tolerance systems would skyrocket even more than I calculated.
If I'm correct, a second parity disk would practically lower the risk to extremely low values.
Question aside, it is important to keep in mind that manufacturers increase the URE probability for consumer-class drives for marketing reasons (sell more enterprise-class drives), therefore even consumer-class HDDs are expected to achieve 1E-15 URE/bit read.
Some data: http://www.high-rely.com/hr_66/blog/why-raid-5-stops-working-in-2009-not/
The values I provided in parentheses (enterprise drives) therefore realistically apply to consumer drives too. And real enterprise drives have an even higher reliability (URE/bit=1e-16).
Concerning the probability of mechanical failures, they are proportional to the number of disks and proportional to the time required to rebuild.
This is the best answer, with theory of probabilities too:
http://evadman.blogspot.com/2010/08/raid-array-failure-probabilities.html?showComment=1337533818123#c7465506102422346169
There are a number of sites and articles that attempt to address this question.
This site has calculators for RAID 0, 5, 10/50/60 levels.
The wikipedia article on RAID levels has sections on RAID 0 and RAID 1 failure rates.
RAID 0:
Reliability of a given RAID 0 set is equal to the average reliability of each disk divided by the number of disks in the set:
That is, reliability (as measured by mean time to failure (MTTF) or mean time between failures (MTBF)) is roughly inversely proportional to the number of members – so a set of two disks is roughly half as reliable as a single disk. If there were a probability of 5% that the disk would fail within three years, in a two disk array, that probability would be increased to {P}(at least one fails) = 1 - {P}(neither fails) = 1 - (1 - 0.05)^2 = 0.0975 = 9.75%.
RAID 1:
As a simplified example, consider a RAID 1 with two identical models of a disk drive, each with a 5% probability that the disk would fail within three years. Provided that the failures are statistically independent, then the probability of both disks failing during the three-year lifetime is 0.25%. Thus, the probability of losing all data is 0.25% over a three-year period if nothing is done to the array.
Also I've found several blog articles about this subject including this one that reminds us the independent drives in a system (the I in RAID) may not be that independent after all:
The naïve theory is that if hard disk 1 has probability of failure 1/1000 and so does disk 2, then the probability of both failing is 1/1,000,000. That assumes failures are statistically independent, but they’re not. You can’t just multiply probabilities like that unless the failures are uncorrelated. Wrongly assuming independence is a common error in applying probability, maybe the most common error.
Joel Spolsky commented on this problem in the latest StackOverflow podcast. When a company builds a RAID, they may grab four or five disks that came off the assembly line together. If one of these disks has a slight flaw that causes it to fail after say 10,000 hours of use, it’s likely they all do. This is not just a theoretical possibility. Companies have observed batches of disks all failing around the same time.