RAID-5: Two disks failed simultaneously?

We have a Dell PowerEdge T410 server running CentOS, with a RAID-5 array containing 5 Seagate Barracuda 3 TB SATA disks. Yesterday the system crashed (I don't know how exactly and I don't have any logs).

Upon booting up into the RAID controller BIOS, I saw that out of the 5 disks, disk 1 was labeled as "missing," and disk 3 was labeled as "degraded." I forced disk 3 back up, and replaced disk 1 with a new hard drive (of the same size). The BIOS detected this and began rebuilding disk 1 - however it got stuck at %1. The spinning progress indicator did not budge all night; totally frozen.

What are my options here? Is there any way to attempt rebuilding, besides using some professional data recovery service? How could two hard drives fail simultaneously like that? Seems overly coincidental. Is it possible that disk 1 failed, and as a result disk 3 "went out of sync?" If so, is there any utility I can use to get it back "in sync?"


Solution 1:

After you accepted a bad answer, I am really sorry for my heretic opinion (which saved such arrays multiple times already).

Your second failed disk has probably a minor problem, maybe a block failure. This is the cause, why the bad sync tool of your bad raid5 firmware crashed on it.

You could easily make a sector-level copy with a lowlevel disk cloning tool (for example, gddrescue is probably very useful), and use this disk as your new disk3. In this case, your array survived with a minor data corruption.

I am sorry, probably it is too late, because the essence of the orthodox answer in this case: "multiple failure in a raid5, here is the apocalypse!"

If you want very good, redundant raid, use software raid in linux. For example, its raid superblock data layout is public and documented... I am really sorry, for my this another heretic opinion.

Solution 2:

You have a double disk failure. This means your data is gone, and you will have to restore from a backup. This is why we aren't supposed to use raid 5 on large disks. You want to set up your raid so you always have the ability to withstand two disk failures, especially with large slow disks.

Solution 3:

Your options are:

  1. Restoring from backups.
    • You do have backups, don't you? RAID is not a backup.

  2. Professional data recovery
    • It's possible, though very expensive and not guaranteed, that a professional recovery service will be able to recover your data.

  3. Accepting your data loss and learning from the experience.
    • As noted in the comments, large SATA disks are not recommended for a RAID 5 configuration because of the chance of a double failure during rebuild causing the array to fail.
      • If it must be parity RAID, RAID 6 is better, and next time use a hot spare as well.
      • SAS disks are better for a variety of reasons, including more reliability, resilience, and lower rates of unrecoverable bit errors that can cause UREs (unrecoverable read errors)
    • As noted above, RAID is not a backup. If the data matters, make sure it's backed up, and that your backups are restore-tested.

Solution 4:

Simultaneous failure is possible, even probable, for the reasons others have given. The other possibility is that one of the disks had failed some time earlier, and you weren't actively checking it.

Make sure your monitoring would pick up a RAID volume running in degraded mode promptly. Maybe you didn't get an option but it's never good to have to learn these things from the BIOS.