Why is mdadm unable to deal with an "almost failed" disk?

Multiple times in my career now I've come across mdadm RAID sets (RAID1+0, 5, 6 etc) in various environments (e.g. CentOS/Debian boxes, Synology/QNAP NASes) which appear to be simply unable to handle failing disk. That is a disk that is not totally dead, but has tens of thousands of bad sectors and is simply unable to handle I/O. But, it isnt totally dead, it's still kind of working. The kernel log is typically full of UNC errors.

Sometimes, SMART will identify the disk as failing, other times there are no other symptoms other than slow I/O.

The slow I/O actually causes the entire system to freeze up. Connecting via ssh takes forever, the webGUI (if it is a NAS) stops working usually. Running commands over ssh takes forever as well. That is until I disconnect / purposely "fail" the disk out of the array, then things go back to "normal" - that is as normal as they can be with a degraded array.

I'm just wondering, if a disk is taking so long to read/write from, why not just knock it out of the array, drop a message in the log and keep going? It seems making the whole system grind to a halt because one disk is kinda screwy totally nullifies one of the main benefits of using RAID (fault tolerance - the ability to keep running if a disk fails). I can understand that in a single-disk scenario (e.g. your system has as single SATA disk connected and it is unable to execute read/writes properly) this is catastrophic, but in a RAID set (especially the fault tolerant "personalities") it seems not only annoying but also contrary to common sense.

Is there a very good reason the default behavior of mdadm is to basically cripple the box until someone remotes in and fixes it manually?


In general, the purpose of a RAID, depending on the chosen Raid level, provides a different balance among the key goals data redundancy, availability, performance and capacity.

Based on the specific requirements, it is the responsibility of the storage owner to decide which balance of the various factors is the right one for the given purpose, to create a reliable solution.

The job of the chosen Raid solution (here in this case we talk about the software mdadm) is to ensure data protection first and foremost. With that in mind, it becomes clear that it is not the job of the raid solution to weight business continuity higher than data integrity.

To put it in other words: The job of mdadm is to avoid a failed raid. As long as a "weird behaving disk" is not completely broken, it still contributes to the raid.

So why not just knocking a weirdly behaving disk out of the array, drop a message in the log and keep going? Because doing so would increase the risk of losing data.

I mean, you are right, for the given moment, from a business perspective, it seems the better solution just to continue. In reality however, the message which has been dropped to the log may remains undetected, so the degraded state of the raid remains undetected. Some time later, eventually another disk in the same raid fails, as result the stored data on the failed raid is eventually gone.


In addition to that: It is hard to exactly define what's a "weirdly behaving disk". Expressed the other way: What is still an acceptable operating behavior of a single disk, operated within an disk array?

Some of us may answer "disk shows some errors". Others may answer: "As long as the errors can be corrected, all is fine". Others may answer: "As long as the disk answers to all commands in a given time, all is fine". Others say "in case the disk temperature differs more than 5°C compared to the average temperature of all disks within the same array". Another answer could be "as long as a scrub reveals no errors", or "as long as SMART does not shows errors".

What is written is not a long and also not a complete list.

The point is that the definition of "acceptable behavior of a disk" is a matter of interpretation, and therefore also the responsibility of the storage owner, and not something that mdadm is supposed to decide on its own.


The key issue is that a failing SATA disk drive can sometime freeze the entire bus for the duration of its internal error recovery procedure. For this reason, one should use TLER-enabled drives only in RAID arrays (and preferably an enterprise-grade model).

SAS drives suffer less from this issue, but are not absolutely free from it either.


In addition to what was said, I want to add my penny, but this one is important consideration.

What a drive does when sector is slow to read?

Supposedly drives that designed to operate alone, e. g. typical "desktop" drives, presume there is no other way to retrieve the data stored in that bad sector. They will try retrieve data at all costs, repeating again and again, for an extended period of time. Of course, they will also mark that sector as failing, so they will remap it next time you write to it, but you must write for that. Until you rewrite it they will choke each time you read from that place. In a RAID setting this means for the RAID the drive still works and there is no reason to kick it out, but for application the array will slow down to a crawl.

On the other hand, "enterprise" drives, especially "branded" ones, often suppose they are always used in RAID setting. A "brand" controller, seeing "branded" drive, actually might even notify their firmware about RAID presence. So the drive will cease early and report I/O error, even if it was possible to do several more attempts and read the sector. Then the controller has the chance to reply faster, mirroring read instruction to a sibling drive (and kicking bad one out of array). When you pull out and explore/test that kicked drive thoroughly you find no apparent problems — is was just slowed down for a moment and that was enough to stop using it, according to a controller logic.

I speculate this may be the only difference between "desktop" drives and "branded"/"enterprise" NL-SAS and SATA ones. This is probably why you pay three times more when you buy "HPE" drive which was actually made by Toshiba, in comparison to buying the "Toshiba"-branded one.


However, some drives do support some generic controls of this. It is called SCT ERC which shands for SMART Command Transport Error Recovery Control. This is how it looks in smartctl:

unsupported

# smartctl --all /dev/sda
=== START OF READ SMART DATA SECTION ===
SCT capabilities:              (0x3037) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

supported

=== START OF READ SMART DATA SECTION ===
...
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

If you lucky, you can control this feauture with smartctl. You may retrieve or set two timeouts, how long to try to re-read and how long to try to re-write:

# smartctl -l scterc /dev/sda
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

# smartctl -l scterc /dev/sde
SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

# smartctl -l scterc /dev/sdd
Warning: device does not support SCT Error Recovery Control command
smartctl -l scterc,120,60 /dev/sde

Which means: 120 tenths of a second to retry read; 60 tenths of a second to retry write. Zero means "retry until you die". Different drives have different default settings for this.

So, if you use "RAID edition" drive alone, better set ERC timeouts to zero, or you may lose data. On the other hand, if you use drives in RAID, better set some reasonable low non-zero setting.

Source by amarao @ Habrahabr, in Russian.

P.S. And a note about SAS. Use sdparm, it supports more controls of this.


I've had situations where a disk has failed to work, but has taken out the controller in some way.

Historically this was possible with PATA, where the master and slave drives were on the same cable, and one drive failing could interfere with access to the other still-good drive. Removing the bad drive could reenable access to the remaining drive, or it may need a power-cycle but the raid could come up degraded and then recovery could start.

SATA is less vulnerable to this, but its still possible for the controller to be affected. From my experience of software raids, there is more of the gory innards exposed that would be hidden by a fancier dedicated raid controller.

I've not experienced this with SAS or NVME, but SAS often means hardware raid controllers that have more disk-handling smarts internally.