Mean Time to Failure (MTTF): When disk manufacturers post this, how should you interpret their numbers?

Solution 1:

First off:

MTTF = Mean Time To Failure
MTTR = Mean Time To Repair
MTBF = Mean Time Between Failures = MTTF + MTTR

MTBF is often more or less equal to MTTF, since repair may take an hour, and MTTF may be tens of thousands of hours. But also MTBF is often not applicable, since defective products don't get repaired, but simply replaced, because repair costs more than replacing.

MTTF calculation is a complex statistical method involving calculating the odds of failing each and every individual part. And it's not a linear thing as people sometimes presume. If you have a MTTF of 1000 000 hours that doesn't mean that in 1000 devices there will be one failing after 1000 hours, or that you will get a failure in 1000 000 devices after 1 hour.
Many electronic devices follow the "bathtub curve",

enter image description here

where there are many failures early on, then a long time with hardly any failures, and near the end of life the number of failures rises again. In hard disks there are also some mechanical parts which have a more linear failure curve; this slowly ramps up from day 1.

If the manufacturer says for instance 1000 000 hours MTTF (that's most often POH, or Power-On Hours) it means that on average the drive should last > 100 years. Some drives will last longer, some will fail earlier on. So despite the 1000 000 hours it's perfectly possible to have a failure after 1000 hours. I once had a drive failing within a week, and then you have to think back of the bathtub curve. The replacement drive has been spinning happily for >50k hours.

Solution 2:

If a piece of equipment has an MTBF of 1,000,000 hours' usage, that doesn't mean that any piece of equipment can be expected to last 1,000,000 hours. Rather, it means, roughly, that if 1,000,000 pieces of equipment which are within their rated service lifetime are each operated for one hour, or 100,000 pieces operated for ten hours (but still within rated lifetime), or 60,000,000 for one minute, etc. there will be roughly one failure in the lot. Note that rated service lifetime is an entirely orthogonal to MTBF. Consider the following two types of widgets:

  1. Every widget, regardless of age, has a 0.1% chance of failing every hour.
  2. Out of every billion widgets, all but one will operate for precisely 61 minutes and then die; that one will die after 30 minutes; the widgets have a specified service lifetime of 60 minutes.

The first type of widget would have an average lifetime of about 1,000 hours, and also have an MTBF of about 1,000 hours. The second would have an average lifetime of 61 minutes, but an MTBF of 1,000,000,000 hours within its service lifetime. While it may seem odd to say the second device has an MTBF that's almost billion times as long as the expected lifetime, the MTBF is hardly a meaningless figure.

Suppose one is going to conduct an experiment that requires that 1,000,000 devices all work perfectly for an hour, after which they will all be scrapped. If any device fails, the entire experiment will be ruined. Which would be more useful--a device which will last an average of 1,000 hours but has an MTBF of only 1,000 hours, or a device which would last at most 61 minutes, but would have only a one in a billion chance of failing to meet that mark?

Solution 3:

Adding to stevenvh's answer: Well known disk manufacturers all do a burn-in run of new devices, as do manufacturers of electronic components. In hard disks, there's not only an overall MTBF and MTTF but also individual failure statistics for the blocks of the disks. In other words: Some parts of the spinning, "platter" in the disk may fail, while the majority still reads/writes ok. The so called "bad sectors" can be detected and then mapped out by the firmware inside the drive.

All drives today contain additional sectors in reserve which can then be used in place of the defect sectors. This is simply a precaution by the manufacturer: If they wouldn't do this, they couldn't sell the disk at the proclaimed capacity. If they build in an additional x % of hidden sectors as a reserve, they increase the cost by some < x % but achieve a much higher overall production yield.

The disks today keep a count of bad sectors which can also be read out with appropriate software. This and other disk health parameters (e.g. temperature) are called SMART values.

Now, once the manufacturer has done the burn-in test of the drive, and some of the sectors have a nearly failure and have been remapped by the drive's internal firmware, the "Bad Sector Count" SMART parameter is set to 0. Then the drive is delivered to customers.

Usually, after the burn-in process, the start of the bathtub curve that has already been mentioned is no longer seen by the customer. We are lucky, and only see an increase in failure likelihood over time.

So if you look at the MTTF that is quoted by the manufacturer, for any failure modeling you might want to do, you can disregard the start of the bathtub curve.