How to clear ZFS DEGRADED status in repaired pool
I got my first drive failure after a couple of years of maintaining this zpool so I did a zpool replace
the drive with one of my spares. It took 60 hours (as shown below) to resilver the array but it seems to have done it with zero errors.
The problem is that it is still showing DEGRADED status. The output is:
# zpool status
pool: sbn
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: resilvered 1.07T in 60h9m with 0 errors on Fri Aug 7 01:15:41 2020
config:
NAME STATE READ WRITE CKSUM
sbn DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-ST4000DM005-2DP166_ZDH1TP9H ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TM7G ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TLHP ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TL8F ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TNT8 ONLINE 0 0 0
spare-5 UNAVAIL 0 0 0
15983766503331633058 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST4000DM005-2DP166_ZDH1TNCF-part1
ata-ST4000DM005-2DP166_ZDH1TW8L ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TW63 ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TM4R ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TLSG ONLINE 0 0 0
ata-ST4000DM005-2DP166_ZDH1TMAM ONLINE 0 0 0
spares
ata-ST4000DM005-2DP166_ZDH1TW8L INUSE currently in use
ata-ST4000DM005-2DP166_ZDH1TM17 AVAIL
errors: No known data errors
I can't find any documentation that explains the spare-5
structure, which showed up after I did the replace. The dead drive shows up as 15983766503331633058
and it remembers the original failed disk id as ata-ST4000DM005-2DP166_ZDH1TNCF
.
How do I clean this up so it is running with 10 clean drives again with one available spare?
Solution 1:
You would need to run the following command:
zpool clear sbn
This will clear all errors associated with the virtual devices in the pool, and clear any data error counts associated with the pool.
Source: https://docs.oracle.com/cd/E36784_01/html/E36835/gbbvf.html
Solution 2:
After some time I found the answer, which turns out that the failed drive needs to be detached from the pool. In this specific case I did:
zpool detach sbn ata-ST4000DM005-2DP166_ZDH1TNCF
Note that the drive ID is taken from the "was" statement in the zpool status above. Once this is done the zpool status is clean and is marked with state: ONLINE.
Hopefully this helps someone in a similar situation.