Linux Software Raid 10 hung after 1 drive failed, mdadm will not let me force remove the faulty device

I have a Linux software raid 10 setup consisting of 5 RAID 1s (Two drives per mirrored setup) and a RAID 0 across all 5 RAID 1 pairs. To test that none of the drives were going to fail quickly under load I used badblocks across the RAID 0 with a destructive read/write mode.

Badblocks command: badblocks -b 4096 -c 98304 -p 0 -w -s /dev/md13

One of the devices failed and instead of the badblocks program happily moving on it hung. If I run a sync command this also hangs. First I would assume this isn't standard behavior for a RAID 1 device. If one of the drives fails it should still be able to write to the virtual device that the two drives make up without a problem.

So I proceeded to force fail the drive and try to remove it. I can set the drive to faulty without any problem (However the IO operations are still hung). I cannot remove the device entirely from the raid it says it is busy. My assumption is that if I can kick it out of the raid entirely the IO will continue but that is just an assumption and I do think I am dealing with a bug of sorts.

What is going on here exactly? Am I in an unrecoverable spot due to a bug?

The system is running kernel 2.6.18 so it isn't exactly new but I would think given that software raid has been around for so long issues like these would not happen.

Any insight is greatly appreciated.

mdadm --detail /dev/md13

/dev/md13:

    Version : 00.90.03   Creation Time : Thu Jan 21 14:21:57 2010
 Raid Level : raid0
 Array Size : 2441919360 (2328.80 GiB 2500.53 GB)    Raid Devices : 5

Total Devices : 5 Preferred Minor : 13 Persistence : Superblock is persistent

Update Time : Thu Jan 21 14:21:57 2010
      State : clean  Active Devices : 5 Working Devices : 5

Failed Devices : 0 Spare Devices : 0

 Chunk Size : 64K

       UUID : cfabfaee:06cf0cb2:22929c7b:7b037984
     Events : 0.3

Number   Major   Minor   RaidDevice State
   0       9        7        0      active sync   /dev/md7
   1       9        8        1      active sync   /dev/md8
   2       9        9        2      active sync   /dev/md9
   3       9       10        3      active sync   /dev/md10
   4       9       11        4      active sync   /dev/md11

The failing raid output:

/dev/md8: Version : 00.90.03 Creation Time : Thu Jan 21 14:20:47 2010 Raid Level : raid1 Array Size : 488383936 (465.76 GiB 500.11 GB) Device Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 2
Total Devices : 2 Preferred Minor : 8 Persistence : Superblock is persistent
Update Time : Mon Jan 25 04:52:25 2010
      State : active, degraded  Active Devices : 1 Working Devices : 1
Failed Devices : 1 Spare Devices : 0
       UUID : 2865aefa:ab6358d8:8f82caf4:1663e806
     Events : 0.11

Number   Major   Minor   RaidDevice State
   0      65       17        0      active sync   /dev/sdr1
   1       8      209        1      faulty   /dev/sdn1

Sorry, maybe I didn't understand well and a cat /proc/mdstat could be helpfull, but as far as I can see you shooted yourself in the foot destroying your data on RAID0 and so on the underlying RAID1 arrays. It is, if you have to test the RAID reliability you must tag as failed a drive, a disk, not to destroy logical blocks that refers to all underlaying RAID1 disks, if I understood well the problem (let me know).

Linux Software Raid 10 hung after 1 drive failed, mdadm will not let me force remove the faulty device

Related

Recent Posts