Where do you find your MTBF data?

Mean time between failures may be difficult to interpret, but there is a wealth of statistical method that you can use if you have some hard data.

Trouble is, nobody reports their MTBF numbers anymore. (Other than hard drive manufacturers, anyway.)

Where do you go to find MTBF data for components and servers?


Solution 1:

Why MTBF doesn't matter

The mean time between failure number isn't as important as the uncorrectable error rate. MTBF deals with the complete failure of the part, read the drive. However that number is meaningless when a single bit in error will cause a RAID 5 panic and bring the hot spare into play.

While the MTBF for professional and consumer level drives has increased by an order of magnitude in recent years, the uncorrectable error rate has stayed relatively constant. This rate is estimated at 10^14 bits, so one bit per 12 terabytes read, for consumer SATA drives, source.

Why you should loose sleep over your RAID 5 array

So, that is only 6 passes of a brand spanking new 2Tb drive. How long does it take to read 12Tb of data? A lot less time than the MTBF for that drive.

http://storagemojo.com/2008/02/18/latent-sector-errors-in-disk-drives/

What is more concerning is the chance of a double read failure on a RAID 5 array consisting of drives that large. With a 7 1Tb drive RAID 5 array, the probably of a second read failures while doing a RAID rebuild, is 50%.

http://blogs.zdnet.com/storage/?p=162

Solution 2:

It is a shame that people think that the MTBF Figures don't apply to complex systems. The real problem (afaik) is that the manufacturers don't have the MTBF figures for their hardware modules. These are figures that should by all rights be available. Dell saying "Dell no longer lists specific MTBF's for their servers." is actually atrocious! They may as well say "Well our stuff is really not reliable enough to be used where an MTBF figure is required".

The reliability engineer (or guy wearing the RE's hat) is supposed to limit the scope of the availability study. This is often limited to the hardware modules.

As for the classification of what constitutes a failure... Well that's why we perform an FMECA analysis.

Sure systems are complex, and failure modes include software failures, but that's often not the scope of the study. We want MTBF Figures for hardware. Ask your salesman to provide this. It is their technical responsibility to provide it to you... If they refuse or side step, go somewhere that has telecom grade servers with mandated availability figures for hardware.