DEGRADED zfs pool vs FAULTED
My backup NAS (Arch-based) reports a degraded pool. It also report a degraded disk as "repairing". I'm confused by this. Presuming that faulted is worse that degraded, should I be worried?
zpool status -v:
pool: zdata
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub in progress since Mon Dec 16 11:35:37 2019
1.80T scanned at 438M/s, 996G issued at 73.7M/s, 2.22T total
1.21M repaired, 43.86% done, 0 days 04:55:13 to go
config:
NAME STATE READ WRITE CKSUM
zdata DEGRADED 0 0 0
wwn-0x50014ee0019b83a6-part1 ONLINE 0 0 0
wwn-0x50014ee057084591-part1 ONLINE 0 0 0
wwn-0x50014ee0ac59cb99-part1 DEGRADED 224 0 454 too many errors (repairing)
wwn-0x50014ee2b3f6d328-part1 ONLINE 0 0 0
logs
wwn-0x50000f0056424431-part5 ONLINE 0 0 0
cache
wwn-0x50000f0056424431-part4 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
zdata/backup:<0x86697>
Also the failing disk is reported much smaller: zpool iostat -v:
capacity operations bandwidth
pool alloc free read write read write
------------------------------ ----- ----- ----- ----- ----- -----
zdata 2.22T 1.41T 33 34 31.3M 78.9K
wwn-0x50014ee0019b83a6-part1 711G 217G 11 8 10.8M 18.0K
wwn-0x50014ee057084591-part1 711G 217G 10 11 9.73M 24.6K
wwn-0x50014ee0ac59cb99-part1 103G 825G 0 10 0 29.1K
wwn-0x50014ee2b3f6d328-part1 744G 184G 11 2 10.7M 4.49K
logs - - - - - -
wwn-0x50000f0056424431-part5 4K 112M 0 0 0 0
cache - - - - - -
wwn-0x50000f0056424431-part4 94.9M 30.9G 0 1 0 128K
------------------------------ ----- ----- ----- ----- ----- -----
[EDIT] As the harddisk kept reporting errors I decided to replace it with a spare one. First I issued a add spare command for the new disk, which was than included in the pool after that I issued a replace command to replace the degraded one with the spare one. It might not have improved things as the pool now reads:
pool: zdata
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Dec 22 10:20:20 2019
36.5G scanned at 33.2M/s, 27.4G issued at 24.9M/s, 2.21T total
0B resilvered, 1.21% done, 1 days 01:35:59 to go
config:
NAME STATE READ WRITE CKSUM
zdata DEGRADED 0 0 0
wwn-0x50014ee0019b83a6-part1 ONLINE 0 0 0
wwn-0x50014ee057084591-part1 ONLINE 0 0 0
spare-2 DEGRADED 0 0 0
wwn-0x50014ee0ac59cb99-part1 DEGRADED 0 0 0 too many errors
wwn-0x50014ee25ea101ef ONLINE 0 0 0
wwn-0x50014ee2b3f6d328-part1 ONLINE 0 0 0
logs
wwn-0x50000f0056424431-part5 ONLINE 0 0 0
cache
wwn-0x50000f0056424431-part4 ONLINE 0 0 0
spares
wwn-0x50014ee25ea101ef INUSE currently in use
errors: No known data errors
What worries me is that the "to go" date keeps going up(!). In the time I wrote this it now reads 1 days 05:40:10. I assume the pool is lost forever when another disk, the controller, or power fails.
[EDIT] The new drive was resilvered after 4 hours or so. The guestimate of ZFS was not so correct apparently. After detaching the faulty drive I now have the situation where the new drive shows only 103G used of a 1TB disk. Just as the DEGRADED drive. How do I get this to the full 1TB?
Solution 1:
Generally speaking, a DEGRADED disk is in better shape than a FAILED one.
From zpool man page (slightly reformatted):
DEGRADED: The number of checksum errors exceeds acceptable levels and the device is degraded as an indication that something may be wrong. ZFS continues to use the device as necessary
FAILED: The number of I/O errors exceeds acceptable levels and the device is faulted to prevent further use of the device
In your specific case, a scrub
discovered many read and checksum error on one disk and ZFS started to repair the affected disk. Meanwhile, ZED (the ZFS event daemon) noticied the burst of checksum errors and degraded the disk to avoid using/stressing it.
When the scrub finishes, I suggest you to zpool clear
your pool and run another zfs scrub
. If the second scrub find no errors you can continue using the pool but, considering how many errors you get in the current scrub, I would replace the disk as soon as possible.
If you have valid reason to belive the disk itself is not fault, you should analyzed dmesg
and smartctl --all
output to nail the root error cause. Case in point: I had a disk which was itself fine, but producing many actual errors due to a noisy PSU/cable.
Anyway, the golden rule always applies: be sure to have a up-to-date backup of your pool data.