mdadm: drive replacement shows up as spare and refuses to sync

Prelude

I had the following devices in my /dev/md0 RAID 6: /dev/sd[abcdef]

The following drives were also present, unrelated to the RAID: /dev/sd[gh]

The following drives were part of a card reader that was connected, again, unrelated: /dev/sd[ijkl]

Analysis

sdf's SATA cable went bad (you could say it was unplugged while in use), and sdf was subsequently rejected from the /dev/md0 array. I replaced the cable and the drive was back, now at /dev/sdm. Please do not challenge my diagnosis, there is no problem with the drive.

mdadm --detail /dev/md0 showed sdf(F), i.e., that sdf was faulty. So I used mdadm --manage /dev/md0 --remove faulty to remove the faulty drives.

Now mdadm --detail /dev/md0 showed "removed" in the space where sdf used to be.

root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 21:16:14 2015
          State : active, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 67205

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       4       0        0        4      removed
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

For some reason the RaidDevice of the "removed" device now matches one that is active. Anyway, let's try add the previous device (now known as /dev/sdm) because that was the original intent:

root@galaxy:~# mdadm --add /dev/md0 /dev/sdm
mdadm: added /dev/sdm
root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 21:19:30 2015
          State : active, degraded
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 67623

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       4       0        0        4      removed
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

       6       8      192        -      spare   /dev/sdm

As you can see, the device shows up as a spare and refuses to sync with the rest of the array:

root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6](S) sdb[5] sda[0] sde[4] sdd[3] sdc[1]
      15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices:

I have also tried using mdadm --zero-superblock /dev/sdm before adding, with the same result.

The reason I am using RAID 6 is to provide high availability. I will not accept stopping /dev/md0 and re-assembling it with --assume-clean or similar as workarounds to resolve this. This needs to be resolved online, otherwise I don't see the point of using mdadm.

Solution 1:

After hours of Googling and some extremely wise help from JyZyXEL in the #linux-raid Freenode channel, we have a solution! There was not a single interruption to the RAID array during this process - exactly what I needed and expected from mdadm.

For some (currently unknown) reason, the RAID state became frozen. The winning command to figure this out is cat /sys/block/md0/md/sync_action:

root@galaxy:~# cat /sys/block/md0/md/sync_action
frozen

Simply put, that is why it was not using the available spares. All my hair is gone at the cost of a simple cat command!

So, just unfreeze the array:

root@galaxy:~# echo idle > /sys/block/md0/md/sync_action

And you're away!

root@galaxy:~# cat /sys/block/md0/md/sync_action
recover
root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6] sdb[5] sda[0] sde[4] sdd[3] sdc[1]
      15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      [>....................]  recovery =  0.0% (129664/3906887168) finish=4016.8min speed=16208K/sec
      bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices: 
root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 22:05:30 2015
          State : active, degraded, recovering
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 73562

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       6       8      192        2      spare rebuilding   /dev/sdm
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

Bliss :-)

Virtual Machine Manager 2012 is showing 0% CPU usage

Dedicated NIC or dedicated port for iSCSI?

Repurposed disk - Linux blkid command returns incorrect information

Migrating an SSL Certificate from one host to another during site migration

cat /proc/cpuinfo shows 3.00GHz model but 1400.000 cpu MHz?

Windows Server 2019 VPN Error 8007042a After KB4480116

Cannot add a 2012 server to a 2003 domain as it says forrest functional level is 2003 but domains and trusts shows it as 2003

Any repercussions from not using default primary groups for Linux users?

stop linux from emailing me cron errors?

Can I upgrade OpenSSL version used by apache without recompiling the server but just mod_ssl?

Ubuntu cp -p on mounted ZFS pool