Can a RAID 4 disk setup crash if only 1 hard disk fails? [closed]

I am a web developer. I have not much experience in hardware. For this reason, I use managed servers.

This morning, one of the drives in our setup failed. However, the full site went down. I asked my web host what happened and he replied that the hard disk failed in such a way that the RAID controller couldn't work properly. The array was set up as RAID 4.

Do you guys ever seen that before? Is it possible?

Thanks for any help on this guys. I need to know if my web host is honest with me.


Solution 1:

More likely than not your provider is using Hard Drives that are not meant to be used in RAID. Normal consumer SATA drives fall into this category.

The likely problem is that the drive started experiencing Uncorrectable Read Errors (UREs). When this happens in a consumer drive, the drive sits there and retries the read operation (usually for 30-60 seconds) until it gives up. The RAID will wait for the drive to report the error (the 30-60) seconds. So a simple request for a few sectors can easily cause the server to grind to a halt while the failed drive grinds through those read-retry operations.

Drives that are meant for RAID Arrays have either Time Limited Error Recovery (for SATA drives). TLER reports failures back to controllers quickly, so that the controller can intelligently respond to such failures (mostly intelligently; hopefully). SCSI (SAS too) work somewhat differently. The SCSI command set allows the controller to specify various recovery effort limits on drives (MODE SELECT: RW ERR RECOVERY). A RAID controller should set the drives to fail quickly, the controller can then test if the drive thinks that it's working properly with the TUR command, fail the drive out of the array if there's a check condition.

Solution 2:

Yes, this is possible, even in scenarios where you would think the array should have survived the failure.

Some possibilities as to why an array fails:

  • More drives failed than could be sustained by the RAID mode. For example:
    • RAID 0 (striping) can not survive any drive failures.
    • RAID 1 can survive failures of all but 1 drive.
    • RAID 4/5 can survive 1 drive failure.
    • RAID 6 can survive 2 drive failures.
    • RAID 10 can survive the failure of up to 50% of the drives, depending on which drives fail.
  • A bug in the RAID software or controller firmware.
  • User error.
    • Someone pulled too many drives.
    • Someone pulled a drive and never replaced it, and another drive subsequently failed.
    • The array was not monitored, allowing more drives to fail than could be survived.
  • Cheap controllers with consumer grade drives are commonly known to fail even in otherwise survivable scenarios.
    • A consumer level drive will attempt almost indefinitely to read a bad sector until it gets a good read. A cheap controller will wait almost indefinitely for such a drive to return a result. The wait can be so long that the operating system gives up. Then on reboot the drives don't respond quickly enough to the controller and the array is assumed to be failed.
    • On the other hand, an enterprise level drive will give up quickly, allowing the controller to pull the data from another drive. Also, a good controller will mark a drive that takes too long to respond as failed and move on.