Raid 5 is brocken after replace a disk

My server mailed me that one of my disks failed to read a block. So I decide to replace it before it failed completely. I added a new disk and replace the failing one.

sudo mdadm --manage /dev/md0 --add /dev/sdg1
sudo mdadm --manage /dev/md0 --replace /dev/sdb1 --with /dev/dbg1

after the sync I wanted to remove the failed /dev/sdb1 and removed it from the array with:

sudo mdadm --manage /dev/md0 --remove /dev/sdb1

but when I want to remove the disk from the case I firstly remove 2 others but put them back immediately. After this I prove if my raid is still working and it didn't. I tried to reboot, in hope it will heal itself. In past this was never a problem, but I also never replaced a disk.

After this is not working, I take a look what to do and tried to re-add the disc, but this did not help also assemble did not work:

sudo mdadm --assamble --scan

detects only 2 disks, so I tried to tell it the name of the disks

sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdf1 /dev/sdc1 /dev/sdd1

but tell me all disks are busy:

sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdf1 /dev/sdc1 /dev/sdd1 
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is busy - skipping
mdadm: /dev/sdf1 is busy - skipping
mdadm: /dev/sdc1 is busy - skipping
mdadm: /dev/sdd1 is busy - skipping

sdg1 gets sdf1 after restart.

mdstat seems to detect the disks correct (I inserted sdb1 again in hope it will help and tried with and without):

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : inactive sdd1[3](S) sdb1[1](S) sdc1[2](S) sda1[0](S) sdf1[4](S)
      14650670080 blocks super 1.2
       
unused devices: <none>

If I query the disks only /dev/sda1 and /dev/sdf1 show me the same Array State AA..

sudo mdadm --query --examine /dev/sda1 
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 7c3e9d4e:6bad2afa:85cd55b4:43e43f56
           Name : lianli:0  (local to host lianli)
  Creation Time : Sat Oct 29 18:52:27 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
     Array Size : 8790402048 (8383.18 GiB 9001.37 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 3e912563:b10b74d0:a49faf2d:e14db558

Internal Bitmap : 8 sectors from superblock
    Update Time : Sat Jan  9 10:06:33 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : c7d96490 - correct
         Events : 303045

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
sudo mdadm --query --examine /dev/sdd1 
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 7c3e9d4e:6bad2afa:85cd55b4:43e43f56
           Name : lianli:0  (local to host lianli)
  Creation Time : Sat Oct 29 18:52:27 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
     Array Size : 8790402048 (8383.18 GiB 9001.37 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : bf303286:5889dc0c:a6a1824a:4fe1ae03

Internal Bitmap : 8 sectors from superblock
    Update Time : Sat Jan  9 10:05:58 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : ef1f16fd - correct
         Events : 303036

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AA.A ('A' == active, '.' == missing, 'R' == replacing)
sudo mdadm --query --examine /dev/sdc1 
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 7c3e9d4e:6bad2afa:85cd55b4:43e43f56
           Name : lianli:0  (local to host lianli)
  Creation Time : Sat Oct 29 18:52:27 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
     Array Size : 8790402048 (8383.18 GiB 9001.37 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : b29aba8f:f92c2b65:d155a3a8:40f41859

Internal Bitmap : 8 sectors from superblock
    Update Time : Sat Jan  9 10:04:33 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 47feb45 - correct
         Events : 303013

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

I will keep on trying but currently I run out of ideas, it is also the first time I replaced a disk in the raid. Hopefully someone can help me.

At least I also have a backup, but I do not want to reset the hard drives to get noticed that the backup also do not work....

Update: After adding all disks to assemble I got:

sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 1.
mdadm: added /dev/sdf1 to /dev/md0 as 1
mdadm: added /dev/sdc1 to /dev/md0 as 2 (possibly out of date)
mdadm: added /dev/sdd1 to /dev/md0 as 3 (possibly out of date)
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array.


Solution 1:

I found a solution:

After more research and the information of 'possible out of date' I got in verbose mode (sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1), I found this page: https://raid.wiki.kernel.org/index.php/RAID_Recovery

In the section 'Trying to assemble using --force', they describe to use force if the event count different is lower 50. My one was much lower so I tried, the raid array connect again and detects one of the disk still as out of date, but I hope it will be able to sync it with the information of the others. So could be that I lost some data, but I learned if I removed the wrong disk from the array to wait till the array is in snyc...

The command I used to get my raid work again:

sudo mdadm --stop /dev/md0
sudo mdadm -v -A --force /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sdf1

Update: One drive was not probably added, so the force only added one driver to get the array back in a work able state. The device with the biggest Event different had to be added later with --re-add:

sudo mdadm --manage /dev/md0 --re-add /dev/sdc1

Now my array is back in sync and I can try to remove the faulty hard drive again.