Simple mdadm RAID 1 not activating spare

Solution 1:

Doing this simply chucks the drive into the array without actually doing anything with it, i.e. it is a member of the array but not active in it. By default, this turns it into a spare:

sudo mdadm /dev/md0 --add /dev/sdb1

If you have a spare, you can grow it by forcing the active drive count for the array to grow. With 3 drives and 2 expected to be active, you would need to increase the active count to 3.

mdadm --grow /dev/md0 --raid-devices=3

The raid array driver will notice that you are "short" a drive, and then look for a spare. Finding the spare, it will integrate it into the array as an active drive. Open a spare terminal and let this rather crude command line run in it, to keep tabs on the re-sync progress. Be sure to type it as one line or use the line break (\) character, and once the rebuild finishes, just type Ctrl-C in the terminal.

while true; do sleep 60; clear; sudo mdadm --detail /dev/md0; echo; cat /proc/mdstat; done

Your array will now have two active drives that are in sync, but because there are not 3 drives, it will not be 100% clean. Remove the failed drive, then resize the array. Note that the --grow flag is a bit of a misnomer - it can mean either grow or shrink:

sudo mdadm /dev/md0 --fail /dev/{failed drive}
sudo mdadm /dev/md0 --remove /dev/{failed drive}
sudo mdadm --grow /dev/md0 --raid-devices=2

With regard to errors, a link problem with the drive (i.e. the PATA/SATA port, cable, or drive connector) is not enough to trigger a failover of a hot spare, as the kernel typically will switch to using the other "good" drive while it resets the link to the "bad" drive. I know this because I run a 3-drive array, 2 hot, 1 spare, and one of the drives just recently decided to barf up a bit in the logs. When I tested all the drives in the array, all 3 passed the "long" version of the SMART test, so it isn't a problem with the platters, mechanical components, or the onboard controller - which leaves a flaky link cable or a bad SATA port. Perhaps this is what you are seeing. Try switching the drive to a different motherboard port, or using a different cable, and see if it improves.


A follow-up: I completed my expansion of the mirror to 3 drives, failed and removed the flaky drive from the md array, hot-swapped the cable for a new one (the motherboard supports this) and re-added the drive. Upon re-add, it immediately started a re-sync of the drive. So far, not a single error has appeared in the log despite the drive being heavily used. So, yes, drive cables can go flaky.

Solution 2:

I've had exactly the same problem, and in my case I've found out that the active raid disk suffered from read-errors during synchronization. Therefore the new disk was newer successfully synchronized and therefore was kept marked as spare.

You might want to check your /var/log/messages and other system logs for errors. Additionally, it might also be a good idea to check your disk's SMART status:
1) Run the short test:

"smartctl -t short /dev/sda"

2) Display the test results:

"smartctl -l selftest /dev/sda"

In my case this returned something like this:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Completed: read failure 90% 7564 27134728
2 Short offline Completed: read failure 90% 7467 1408449701

I had to boot a live distro and manually copy the data from the defective disk to the new (currently "spare") one.