Is bit rot on hard drives a real problem? What can be done about it?

A friend is talking with me about the problem of bit rot - bits on drives randomly flipping, corrupting data. Incredibly rare, but with enough time it could be a problem, and it's impossible to detect.

The drive wouldn't consider it to be a bad sector, and backups would just think the file has changed. There's no checksum involved to validate integrity. Even in a RAID setup, the difference would be detected but there would be no way to know which mirror copy is correct.

Is this a real problem? And if so, what can be done about it? My friend is recommending zfs as a solution, but I can't imagine flattening our file servers at work, putting on Solaris and zfs..


Solution 1:

First off: Your file system may not have checksums, but your hard drive itself has them. There's S.M.A.R.T., for example. Once one bit too many got flipped, the error can't be corrected, of course. And if you're really unlucky, bits can change in such a way that the checksum won't become invalid; then the error won't even be detected. So, nasty things can happen; but the claim that a random bit flipping will instantly corrupt you data is bogus.

However, yes, when you put trillions of bits on a hard drive, they won't stay like that forever; that's a real problem! ZFS can do integrity checking every time data is read; this is similar to what your hard drive already does itself, but it's another safeguard for which you're sacrificing some space, so you're increasing resilience against data corruption.

When your file system is good enough, the probability of an error occurring without being detected becomes so low that you don't have to care about that any longer and you might decide that having checksums built into the data storage format you're using is unnecessary.

Either way: no, it's not impossible to detect.

But a file system, by itself, can never be a guarantee that every failure can be recovered from; it's not a silver bullet. You still must have backups and a plan/algorithm for what to do when an error has been detected.

Solution 2:

Yes it is a problem, mainly as the drive sizes go up. Most SATA drives have a URE (uncorrectable read error) rate of 10^14. Or for every 12TB of data read statistically the drive vendor says the drive will return a read fail (you normally can look them up on the drive spec sheets). The drive will continue to work just fine for all other parts of the drive. Enterprise FC & SCSI drive generally have a URE rate of 10^15 (120TB) along with a small number of SATA drives which helps reduce it.

I've never seen to disks stop rotating at the exact same time, but I have had a raid5 volume hit this issue (5 years ago with 5400RPM consumer PATA drives). Drive fails, it's marked dead and a rebuild occurs to the spare drive. Problem is that during the rebuild a second drive is unable to read that one little block of data. Depending upon whos doing the raid the entire volume might be dead or just that little block may be dead. Assuming it's only that one block is dead, if you try to read it you'll get an error but if you write to it the drive will remap it to another location.

There are multiple methods to protect against: raid6 (or equivalent) which protects against double disk failure is best, additional ones are a URE aware filesystem such as ZFS, using smaller raid groups so statistically you have a lower chance of hitting the the URE drive limits (mirror large drives or raid5 smaller drives), disk scrubbing & SMART also helps but is not really a protection in itself but used in addition to one of the above methods.

I manage close to 3000 spindles in arrays, and the arrays are constantly scrubbing the drives looking for latent URE's. And I receive a fairly constant stream of them (every time it finds one it fixes it ahead of the drive failure and alerts me), if I was using raid5 instead of raid6 and one of the drives went completely dead... I'd be in trouble if it hit certain locations.

Solution 3:

Hard drives do not generally encode data bits as single magnetic domains -- hard drive manufacturers have always been aware that magnetic domains could flip, and build in error detection and correction to drives.

If a bit flips, the drive contains enough redundant data that it can and will be corrected the next time that sector is read. You can see this if you check the SMART stats on the drive, as the 'Correctable error rate'.

Depending on the details of the drive, it should even be able to recover from more than one flipped bit in a sector. There will be a limit to the number of flipped bits that can be silently corrected, and probably another limit to the number of flipped bits that can be detected as an error (even if there is no longer enough reliable data to correct it)

This all adds up to the fact that hard drives can automatically correct most errors as they happen, and can reliably detect most of the rest. You would have to be have a large number of bit errors in a single sector, that all occurred before that sector was read again, and the errors would have to be such that the internal error detection codes see it as valid data again, before you would ever have a silent failure. It's not impossible, and I'm sure that companies operating very large data centres do see it happen (or rather, it occurs and they don't see it happen), but it's certainly not as big a problem as you might think.