How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt?

I hate to say "don't use SATA" in critical production environments, but I've seen this situation quite often. SATA drives are not generally meant for the duty cycle you describe, although you did spec drives specifically rated for 24x7 operation in your setup. My experience has been that SATA drives can fail in unpredictable ways, often times affecting the entire storage array, even when using RAID 1+0, as you've done. Sometimes the drives fail in a manner that can stall the entire bus. One thing to note is whether you're using SAS expanders in your setup. That can make a difference in how the remaining disks are impacted by a drive failure.

But it may have made more sense to go with midline/nearline (7200 RPM) SAS drives versus SATA. There's a small price premium over SATA, but the drives will operate/fail more predictably. The error-correction and reporting in the SAS interface/protocol is more robust than the SATA set. So even with drives whose mechanics are the same, the SAS protocol difference may have prevented the pain you experienced during your drive failure.


How can a single disk bring down the array? The answer is that it shouldn't, but it kind of depends on what is causing the outage. If the disk were to die in a way that behaved, it shouldn't take it down. But it's possible that it's failing in an "edge case" way that the controller can't handle.

Are you naive to think this shouldn't happen? No, I don't think so. A hardware RAID card like that should have handled most issues.

How to prevent it? You can't anticipate weird edge cases like this. This is part of being a sysadmin...but you can work on recovery procedures to keep it from impacting your business. The only way to try to fix this right now is to either try another hardware card (not probably what you'd want to do) or change your drives to SAS drives instead of SATA to see if SAS is more robust. You can also contact your vendor of the RAID card and tell them what has happened and see what they say; they are, after all, a company that is supposed to specialize in knowing the ins and outs of wonky drive electronics. They may have more technical advice on how the drives work as well as reliability...if you can get to the right people to talk to.

Have you missed something? If you want to verify that the drive is having an edge-case failure, pull it from the array. The array will be degraded but you shouldn't have more of the weird slowdowns and errors (aside from the degraded array status). You're saying that right now it seems to be working fine, but if it's having disk read errors, you should replace the drive while you can. Drives with high capacity can sometimes have URE errors (best reason not to run RAID 5, side note) that don't show up until another drive has failed. And if you're experiencing edge-case behavior from that one drive, you don't want corrupted data migrated to the other drives in the array.


I'm not an expert, but I'm going to take a wild shot in the dark on the basis of my experience with RAID controllers and storage arrays.

Disks fail in many different ways. Unfortunately, disks can fail, or be faulty, in ways where their performance is seriously affected but the RAID controller doesn't see as being a failure.

If a disk fails in an obvious way, any RAID controller software should be pretty good at detecting lack of response from the disk, removing it from the pool and firing any notifications. However, my guess as to what's happening here is that the disk is suffering an unusual failure which, for some reason are not triggering a failure on the controller side. Therefore when the controller is conducting a write flush or a read from the affected disk, it's taking a long time to come back and in turn is hanging the whole IO operating and therefore the array. For whatever reason, this isn't enough for the RAID controller to go "ah, failed disk", probably because the data ends up coming back eventually.

My advice would be to immediately replace the failed disk. After that, I'd take a look at the configuration for your RAID card (It's 3ware, I thought they were pretty good) and find out what it considers a failed disk to be.

P.S. nice idea importing SMART into cacti.


Just a guess: the harddisks are configured to retry on read errors rather than report an error. While this is desirable behaviour in a desktop setting, it is counterproductive in a RAID (where the controller should rewrite any sector that fails reading from the other disks, so the drive can remap it).


my shot in the dark:

  • drive 7 is failing. it has some failure windows where it's not available.

  • drive 8 has some 'lighter' errors too; corrected by retrying.

  • RAID10 is usually "a RAID0 of several RAID1 pairs", are drive 7 and 8 members of the same pair?

if so, then it seems you hit the "shouldn't happen" case of two-disk failure on the same pair. almost the only thing that can kill a RAID10. unfortunately, it can happen if all your drives are from the same shipping lot, so they're slightly more likely to die simultaneously.

I guess that during a drive 7 failure, the controller redirected all reads to drive 8, so any error-retry caused big delays that caused an avalanche of frozen tasks, killing performance for a while.

you're lucky that drive 8 doesn't seem to be dead yet, so you should be able to fix without dataloss.

I'd start by changing both drives, and don't forget to check cabling. a loose connection could cause this, and if not routed firmly, it's more likely to happen in adjacent drives. also, some multiport cards have several two-port connectors, if drive 7 and drive 8 are on the same one, it might be the source of your trouble.