Mean Time Between Failures -- SSD

Solution 1:

Drive manufacturers specify the reliability of their products in terms of two related metrics: the annualized failure rate (AFR), which is the percentage of disk drives in a population that fail in a test scaled to a per year estimation; and the mean time to failure (MTTF).

The AFR of a new product is typically estimated based on accelerated life and stress tests or based on field data from earlier products. The MTTF is estimated as the number of power on hours per year divided by the AFR. A common assumption for drives in servers is that they are powered on 100% of the time.

http://www.cs.cmu.edu/~bianca/fast/

MTTF of 1.5 million hours sounds somewhat plausible.

That would roughly be a test with 1000 drives running for 6 months and 3 drives failing.
The AFR would be (2* 6 months * 3)/(1000 drives)=0.6% annually and the MTTF = 1yr/0.6%=1,460,967 hours or 167 years.

A different way to look at that number is when you have 167 drives and leave them running for a year the manufacturer claims that on average you'll see one drive fail.

But I expect that is simply the constant "random" mechanical/electronic failure rate.

Assuming that failure rates follow the bathtub curve, as mentioned in the comments, the manufacturer's marketing team can massage the reliability numbers a bit, for instance by not including DOA'S (dead on arrival, units that passed quality control but fail when the end-user installs them) and stretching the DOA definition to also exclude those in the early failure spike. And because testing isn't performed long enough you won't see age effects either.

I think the warranty period is a better indication for how long a manufacturer really expects a SSD to last!
That definitely won't be measured in decades or centuries...


Associated with the MTBF is the reliability associated with the finite number of write cycles NAND cells can support. A common metric is the total write capacity, usually in TB. In addition to other performance requirements that is one big limiter.

To allow a more convenient comparison between different makes and differently sized sized drives the write endurance is often converted to daily write capacity as a fraction of the disk capacity.

Assuming that a drive is rated to live as long as it's under warranty:
a 100 GB SSD may have a 3 year warranty and a write capacity 50 TB:

        50 TB
---------------------  = 0.46 drive per day write capacity.
3 * 365 days * 100 GB

The higher that number, the more suited the disk is for write intensive IO.
At the moment (end of 2014) value server line SSD's have a value of 0.3-0.8 drive/day, mid-range is increasing steadily from 1-5 and high-end seems to sky-rocket with write endurance levels of up to 25 * the drive capacity per day for 3-5 years.

Some real world tests show that sometimes the vendor claims can be massively exceeded, but driving equipment way past the vendor limits isn't always an enterprise consideration... Instead buy correctly spec'd drives for your purposes.

Solution 2:

Unfortunately the MTBF isn't what most people think...

  • It is not how long an individual drive will last.

    Manufacturers expect their drives to last as long as the warranty, after that it really isn't their problem. Older electromagnetic platter hard drives will seize up after 10 or so years. Integrated circuits last an extremely long time, but other components (notably capacitors) wear out after somewhat predictable number of cycles.

  • It is how many of these drives you would need to expect 1 drive to fail every hour.

    As others have pointed out manufactures do various testing over a reasonable period of time and determine a failure rate. There's a fair amount of variance in these sorts of tests and marketing often has "input" as to what the final number should be. Regardless they make a best effort guess as to how many drives would be needed to average one failure per hour.

    For situations with less drives you can infer a statistical probability of failure based on the MTBF, but keep in mind that failures in well designed products should follow a "bathtub" curve - that is higher failure rates when devices are initially put into service and after their warranty period has expired, with lower failure rates in between.