How reliable are current 2 TByte consumer grade disk drives?

Most cheap SATA disk drives are rated with "1 non-recoverable read error per 10^14 bits read".

What does this mean?

10^14 bits is just 12.5 TByte. If I have a full 2 TByte disk and I copy it to a second disk, is there in fact a chance of roughly 1/6 that one of the files is corrupted?

If this happens, will the affected block be marked and reallocated? I think so because if the read would be successful on retry, it's not a non-recoverable read error.

However I am using lots of these disk drives for a couple of years now, I haven't noticed any increase of bad blocks count, also the RAID controller logs do not show any read problems.

EDIT: The RAID controllers do a weekly patrol read of each disk, so that amounts to about 100 TByte per year. That is still less than 10^15 bits.

On the other hand there were 4 total disk failures out of 50 disks within 2 years which increases the error rate.

I don't have enough statistically significant data to make statements but in my case the actual error rate seems to be between 10^14 and 10^15 which is consistent with the specification.


Solution 1:

The statistic given 1 read error in 10^14 is just that, a statistic data point. It doesn't mean that in any given disk you will see errors and it doesn't say that it necessarily happens from the start of the disk life to the end. It only means that the disk is not rated as high as the enterprise disks and based on my experience (developing enterprise storage systems) both the consumer and the enterprise disks never reach the full MTBF numbers either. I don't remember a big difference between them either. There was some difference but not a very big one.

When a block is being read and that fails, then the disk will put the sector to a holding list.When the sector is next written, the write will be attempted and the sector is verified, if that works nothing will be done any further. If the verification fails the sector will be reallocated.

In many cases the disks and the RAID can correct bad sectors on the fly even before they get to be a big problem. There is a background media scan of the disk and the disk scrub of the RAID array and both of these work to protect the data. Enterprise storage arrays use finer grained checks to make sure that even slightly problematic disk sectors will be treated and fixed.

There are other issues with using consumer drives in a RAID array, TLER is one of them, it essentially means that you may lose the disk for even one bad sector, since the disk stops responding until it succeeds to read the sector. TLER is actually the method to avoid the RAID calling the disk failed when it's just a small media problem. If you have TLER enabled the disk will quickly give up on the sector and let the RAID handle the failure at its level.

Solution 2:

You should be careful with RAID and consumer drives. Depending on your RAID controller, it will throw up if the disk takes too long to respond because of not having TLER.

What happens to an unrecoverable block is described here

When a sector is found to be bad or unstable by the firmware of a disk controller, the disk controller remaps the logical sector to a different physical sector. In the normal operation of a hard drive, the detection and remapping of bad sectors should take place in a manner transparent to the rest of the system and in advance before data is lost. It should be remembered, however, that the damaging of the physical body of the hard drive does not solely affect one area of the data stored. Very often physical damages can interfere with parts of many different files.

As to your question about the 1/6 chance that a block is corrupted, that is true for the block, however, Operating Systems / File Systems have their own ways of dealing with bad blocks, and recovery from it, so it is quite possible that the OS/FS is able to recover the bad block on it's own, without you noticing any corruption of files.