How do I troubleshoot my RAID array?

Solution 1:

Seeing as how you don't see the broken drive (marked with F) in the output of cat /proc/mdstat, you have booted the server since the array was degraded.

You can obtain info with mdadm --detail /dev/md0. That will probably tell you which other drive should be in it.

To respond to your edit:

I would analyze /dev/sdb first. Use smartctl -a to check (especially) the reallocated sector count and the error log. Do a self test with smartctl -t long /dev/sdb. Use badblocks, etc.

Then:

  • If you replace /dev/sdb, copy the partition table from /dev/sdc. If they're not GPT, you can use sfdisk -d /dev/sdc | sfdisk /dev/sdb. Or if they are GPT, you can use gdisk to save the partition table to file, and then load it. It's hidden under advanced functions.
  • Something general to consider: if your (new) drive has 4k sectors, make sure the partitions are 4k aligned.
  • If you're going to re-add your existing /dev/sdb, you may want to run mdadm --zero-superblock on all existing partitions.
  • Then you can mdadm --manage /dev/md0 --add /dev/sdb6 and the same for md1 and sdb7

Needless to say, some commands wipe out your data if you mix up your drives. So, be sure what is sdc and sdb...

Edit: about bad blocks: If any software level tool sees badblocks, the drive is busted. Normally, disks hide them by reallocating them transparantly upon write. Google for 'hard drive sector reallocation'. Your smartctl -a output should show reallocated sectors for sdb. So yeah, your sdb has been kicked out of the array and you need to replace it.

Edit: about the smartctl -a output. There are two things in there that are of primary importance:

  • It shows 60 reallocated sectors. Even though the normalized value is still 99 and only would officially be 'bad' if it reached 36 (it counts down), you shouldn't trust disks that starts reallocting sectors. So especially if this value starts changing, the raw value, it's important. You can even configure smartd to monitor it for you.
  • The error log shows entries at age 42372 hours. You can tell that was recent, because of parameter 9 (in your case), Power on hours. There are harmless things that can cause SMART error log entries, like giving wrong ATA commands, but in this case, because you have a degraded array, it's likely they are related.

As for determining which disk it is in your system; for example, doing dmesg |grep -i sdb will help. You probably have three disks in your system and sdb is the one on your second SATA controller, which can be named 1 or 2, depending if it's zero-based or one-based.

Because you likely boot from sda, you can just replace sdb and perform the operations I outlined above. If your boot drive is broken, you hope that you have:

  • Installed grub on the other disk(s) as well.
  • Have a server that can actually boot from another disk.

The other day with a Dell server, it didn't want to start from sdb when there was a blank sda in it. That took some convincing and improvising.

Sometimes you need to translate names like ata1.01 to real device names. For example, failing disks will give kernel errors saying 'ATA exception on ata1.01' or words to that effect. Read this answer for that. (I configured our central logging system to warn me of those kernel errors, because they are a reliable indication of pending disk failure).