zfs pool status unstable

I've been running zfs pool on ubuntu problem free for years. currently on 20.04

since around beginning of this year I've had to replace 2 out of 4 disks and even then brand new disks started showing errors.

started scrubbing it weekly and the things were kinda stable. 20-50 errors read and/or write errors would appear on some disks and scrub would fix them.

few days ago however a disk was faulted for too many errors. then second one degraded. running scrub made things worse.

triggered scrub today then realized disks may be too hot, shut down the pc to adjust fans, started again and zpool status shows this:

 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jun 19 18:44:07 2021
    1.51T scanned at 2.74G/s, 1.29T issued at 2.35G/s, 3.04T total
    2.76G resilvered, 42.42% done, 0 days 00:12:44 to go
config:

    NAME                                           STATE     READ WRITE CKSUM
    ztank                                          DEGRADED     0     0     0
      mirror-0                                     DEGRADED     0     0     0
        ata-ST2000LM003_HN-M201RAD_S34RJ9AFB25570  DEGRADED     0     0     0  too many errors
        ata-ST2000LM003_HN-M201RAD_S362J9EGB75740  ONLINE       0     0     0  (resilvering)
      mirror-1                                     ONLINE       0     0     0
        ata-ST2000DM008-2FR102_ZFL3P2SZ            ONLINE       0     0     0
        ata-TOSHIBA_HDWL120_807APRBUT              ONLINE       0     0     0  (resilvering)
    logs
      zfs_slog                                     ONLINE       0     0     0
    cache
      zfs_l2arc                                    ONLINE       0     0     0

errors: No known data errors

I'm really shocked what's going on


Solution 1:

Well, looks like you answered yourself - disks were too hot so they started failing. See if you can recover from that degraded state.

Also, check your RAM. Do full memtest. If they are ok, check SATA cables too. Check all SMART stats and to test=long on all of them via smartctl. And never overheat your HDDs.