DEGRADED zfs pool vs FAULTED

My backup NAS (Arch-based) reports a degraded pool. It also report a degraded disk as "repairing". I'm confused by this. Presuming that faulted is worse that degraded, should I be worried?

zpool status -v:

  pool: zdata
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Dec 16 11:35:37 2019
        1.80T scanned at 438M/s, 996G issued at 73.7M/s, 2.22T total
        1.21M repaired, 43.86% done, 0 days 04:55:13 to go
config:

        NAME                            STATE     READ WRITE CKSUM
        zdata                           DEGRADED     0     0     0
          wwn-0x50014ee0019b83a6-part1  ONLINE       0     0     0
          wwn-0x50014ee057084591-part1  ONLINE       0     0     0
          wwn-0x50014ee0ac59cb99-part1  DEGRADED   224     0   454  too many errors  (repairing)
          wwn-0x50014ee2b3f6d328-part1  ONLINE       0     0     0
        logs
          wwn-0x50000f0056424431-part5  ONLINE       0     0     0
        cache
          wwn-0x50000f0056424431-part4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        zdata/backup:<0x86697>

Also the failing disk is reported much smaller: zpool iostat -v:

                                  capacity     operations     bandwidth
pool                            alloc   free   read  write   read  write
------------------------------  -----  -----  -----  -----  -----  -----
zdata                           2.22T  1.41T     33     34  31.3M  78.9K
  wwn-0x50014ee0019b83a6-part1   711G   217G     11      8  10.8M  18.0K
  wwn-0x50014ee057084591-part1   711G   217G     10     11  9.73M  24.6K
  wwn-0x50014ee0ac59cb99-part1   103G   825G      0     10      0  29.1K
  wwn-0x50014ee2b3f6d328-part1   744G   184G     11      2  10.7M  4.49K
logs                                -      -      -      -      -      -
  wwn-0x50000f0056424431-part5     4K   112M      0      0      0      0
cache                               -      -      -      -      -      -
  wwn-0x50000f0056424431-part4  94.9M  30.9G      0      1      0   128K
------------------------------  -----  -----  -----  -----  -----  -----

[EDIT] As the harddisk kept reporting errors I decided to replace it with a spare one. First I issued a add spare command for the new disk, which was than included in the pool after that I issued a replace command to replace the degraded one with the spare one. It might not have improved things as the pool now reads:

  pool: zdata
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Dec 22 10:20:20 2019
        36.5G scanned at 33.2M/s, 27.4G issued at 24.9M/s, 2.21T total
        0B resilvered, 1.21% done, 1 days 01:35:59 to go
config:

        NAME                              STATE     READ WRITE CKSUM
        zdata                             DEGRADED     0     0     0
          wwn-0x50014ee0019b83a6-part1    ONLINE       0     0     0
          wwn-0x50014ee057084591-part1    ONLINE       0     0     0
          spare-2                         DEGRADED     0     0     0
            wwn-0x50014ee0ac59cb99-part1  DEGRADED     0     0     0  too many errors
            wwn-0x50014ee25ea101ef        ONLINE       0     0     0
          wwn-0x50014ee2b3f6d328-part1    ONLINE       0     0     0
        logs
          wwn-0x50000f0056424431-part5    ONLINE       0     0     0
        cache
          wwn-0x50000f0056424431-part4    ONLINE       0     0     0
        spares
          wwn-0x50014ee25ea101ef          INUSE     currently in use

errors: No known data errors

What worries me is that the "to go" date keeps going up(!). In the time I wrote this it now reads 1 days 05:40:10. I assume the pool is lost forever when another disk, the controller, or power fails.

[EDIT] The new drive was resilvered after 4 hours or so. The guestimate of ZFS was not so correct apparently. After detaching the faulty drive I now have the situation where the new drive shows only 103G used of a 1TB disk. Just as the DEGRADED drive. How do I get this to the full 1TB?


Solution 1:

Generally speaking, a DEGRADED disk is in better shape than a FAILED one.

From zpool man page (slightly reformatted):

DEGRADED: The number of checksum errors exceeds acceptable levels and the device is degraded as an indication that something may be wrong. ZFS continues to use the device as necessary

FAILED: The number of I/O errors exceeds acceptable levels and the device is faulted to prevent further use of the device

In your specific case, a scrub discovered many read and checksum error on one disk and ZFS started to repair the affected disk. Meanwhile, ZED (the ZFS event daemon) noticied the burst of checksum errors and degraded the disk to avoid using/stressing it.

When the scrub finishes, I suggest you to zpool clear your pool and run another zfs scrub. If the second scrub find no errors you can continue using the pool but, considering how many errors you get in the current scrub, I would replace the disk as soon as possible.

If you have valid reason to belive the disk itself is not fault, you should analyzed dmesg and smartctl --all output to nail the root error cause. Case in point: I had a disk which was itself fine, but producing many actual errors due to a noisy PSU/cable.

Anyway, the golden rule always applies: be sure to have a up-to-date backup of your pool data.