Raid 5 is brocken after replace a disk
My server mailed me that one of my disks failed to read a block. So I decide to replace it before it failed completely. I added a new disk and replace the failing one.
sudo mdadm --manage /dev/md0 --add /dev/sdg1
sudo mdadm --manage /dev/md0 --replace /dev/sdb1 --with /dev/dbg1
after the sync I wanted to remove the failed /dev/sdb1 and removed it from the array with:
sudo mdadm --manage /dev/md0 --remove /dev/sdb1
but when I want to remove the disk from the case I firstly remove 2 others but put them back immediately. After this I prove if my raid is still working and it didn't. I tried to reboot, in hope it will heal itself. In past this was never a problem, but I also never replaced a disk.
After this is not working, I take a look what to do and tried to re-add the disc, but this did not help also assemble did not work:
sudo mdadm --assamble --scan
detects only 2 disks, so I tried to tell it the name of the disks
sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdf1 /dev/sdc1 /dev/sdd1
but tell me all disks are busy:
sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdf1 /dev/sdc1 /dev/sdd1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is busy - skipping
mdadm: /dev/sdf1 is busy - skipping
mdadm: /dev/sdc1 is busy - skipping
mdadm: /dev/sdd1 is busy - skipping
sdg1 gets sdf1 after restart.
mdstat seems to detect the disks correct (I inserted sdb1 again in hope it will help and tried with and without):
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdd1[3](S) sdb1[1](S) sdc1[2](S) sda1[0](S) sdf1[4](S)
14650670080 blocks super 1.2
unused devices: <none>
If I query the disks only /dev/sda1
and /dev/sdf1
show me the same Array State AA..
sudo mdadm --query --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 7c3e9d4e:6bad2afa:85cd55b4:43e43f56
Name : lianli:0 (local to host lianli)
Creation Time : Sat Oct 29 18:52:27 2016
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
Array Size : 8790402048 (8383.18 GiB 9001.37 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=0 sectors
State : clean
Device UUID : 3e912563:b10b74d0:a49faf2d:e14db558
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Jan 9 10:06:33 2021
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : c7d96490 - correct
Events : 303045
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AA.. ('A' == active, '.' == missing, 'R' == replacing)
sudo mdadm --query --examine /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 7c3e9d4e:6bad2afa:85cd55b4:43e43f56
Name : lianli:0 (local to host lianli)
Creation Time : Sat Oct 29 18:52:27 2016
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
Array Size : 8790402048 (8383.18 GiB 9001.37 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=0 sectors
State : clean
Device UUID : bf303286:5889dc0c:a6a1824a:4fe1ae03
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Jan 9 10:05:58 2021
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : ef1f16fd - correct
Events : 303036
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AA.A ('A' == active, '.' == missing, 'R' == replacing)
sudo mdadm --query --examine /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 7c3e9d4e:6bad2afa:85cd55b4:43e43f56
Name : lianli:0 (local to host lianli)
Creation Time : Sat Oct 29 18:52:27 2016
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
Array Size : 8790402048 (8383.18 GiB 9001.37 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=0 sectors
State : clean
Device UUID : b29aba8f:f92c2b65:d155a3a8:40f41859
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Jan 9 10:04:33 2021
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 47feb45 - correct
Events : 303013
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
I will keep on trying but currently I run out of ideas, it is also the first time I replaced a disk in the raid. Hopefully someone can help me.
At least I also have a backup, but I do not want to reset the hard drives to get noticed that the backup also do not work....
Update: After adding all disks to assemble I got:
sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 1.
mdadm: added /dev/sdf1 to /dev/md0 as 1
mdadm: added /dev/sdc1 to /dev/md0 as 2 (possibly out of date)
mdadm: added /dev/sdd1 to /dev/md0 as 3 (possibly out of date)
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array.
Solution 1:
I found a solution:
After more research and the information of 'possible out of date' I got in verbose mode (sudo mdadm -v -A /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sdf1
), I found this page: https://raid.wiki.kernel.org/index.php/RAID_Recovery
In the section 'Trying to assemble using --force', they describe to use force if the event count different is lower 50. My one was much lower so I tried, the raid array connect again and detects one of the disk still as out of date, but I hope it will be able to sync it with the information of the others. So could be that I lost some data, but I learned if I removed the wrong disk from the array to wait till the array is in snyc...
The command I used to get my raid work again:
sudo mdadm --stop /dev/md0
sudo mdadm -v -A --force /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sdf1
Update:
One drive was not probably added, so the force only added one driver to get the array back in a work able state. The device with the biggest Event different had to be added later with --re-add
:
sudo mdadm --manage /dev/md0 --re-add /dev/sdc1
Now my array is back in sync and I can try to remove the faulty hard drive again.