How does RAID detect a faulty HD?
I have been looking over Raid levels over the past 3 days. And have been weighing up the pro/cons of raid controllers hardware/software. I understand that RAID is not a backup solution and I'm perfectly fine with it, though one question still remains.
How does a RAID controller, even Raid 1 to Raid 6 actually detect that a hard disk drive is failing. The research that I have done have showed that most common hard disk drive manufactures use ECC in their hard disk drive design that is suppose to protect against 1 bit failures to an extent 3 bits.
Though when thinking about this, lets say you have Raid (1) and two hard disk drives that are identical. Lets say, data is read from drive 0, and also at the same time from drive 1. Though drive 1 reports a ECC read failure to the Raid Controller.
Now this is the big question, with hardware raid what would the Raid controller do? Its got a signal from the hard disk that the read failed. It can report the hard disk drive as faulty and need replacing.
Does the Raid Controller Seeks to a different hard disk drive for the data until it gets a successfully read from the drive. (Yes, a drive can report read correct and the data can still be corrupted, and RAID does not check polarity or ECC on read)
I asked a NetApp engineer who was giving us a talk this very question. His answer, more or less, was:
Nobody reads the checksums on reads. There's no point. Reading a checksum means you have to read the entire slice plus checksum, then compute the checksum to verify you have the correct data. Plus the orthoganal checksum if you are running RAID-6 or whatever. It is a total performance killer because it breaks the ability to randomly seek to totally different sectors on different disks at the same time. Similarly, almost nobody reads both sides of a mirror in RAID-1 because if you only read one side you can alternate which side of the mirror you read from so that you get faster throughput, and if you suddenly have a mismatch, which disk do you take as correct and which do you take as broken? All modern RAID systems depend on the on-disk controllers to signal the RAID controller that they are in distress (through SMART or the like), at which point that disk is almost always kicked out of the array. Checksums are used for rebuilding arrays, not for read-verification.
The answer to the question is going to depend greatly on the RAID controller manufacturer and how they implemented error/failed drive detection.
There are various methods that a RAID implementations can assess the "health" of a disk (SMART, SCSI "Check Condition" and "Sense Key" messages), but I'm not aware of any published "standard" as to how RAID implementations should act on these methods. The specific steps that each make and model of RAID controller firmware (or, for that matter, a software RAID implementation in an OS) uses are going to vary depending on the manufacturer's design.
All hard disk drives use error correcting codes (ECC) today. At the data densities we're working at bit errors are just a fact of life. Unrecoverable read errors are what matter to a RAID controller. At the level you're interested in, you'd have to have the design specs on both the RAID controller and the drive firmware to really understand how media errors would be reported up the device stack to the OS, and ultimately the user.