zfs pool status unstable
I've been running zfs pool on ubuntu problem free for years. currently on 20.04
since around beginning of this year I've had to replace 2 out of 4 disks and even then brand new disks started showing errors.
started scrubbing it weekly and the things were kinda stable. 20-50 errors read and/or write errors would appear on some disks and scrub would fix them.
few days ago however a disk was faulted for too many errors. then second one degraded. running scrub made things worse.
triggered scrub today then realized disks may be too hot, shut down the pc to adjust fans, started again and zpool status shows this:
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jun 19 18:44:07 2021
1.51T scanned at 2.74G/s, 1.29T issued at 2.35G/s, 3.04T total
2.76G resilvered, 42.42% done, 0 days 00:12:44 to go
config:
NAME STATE READ WRITE CKSUM
ztank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-ST2000LM003_HN-M201RAD_S34RJ9AFB25570 DEGRADED 0 0 0 too many errors
ata-ST2000LM003_HN-M201RAD_S362J9EGB75740 ONLINE 0 0 0 (resilvering)
mirror-1 ONLINE 0 0 0
ata-ST2000DM008-2FR102_ZFL3P2SZ ONLINE 0 0 0
ata-TOSHIBA_HDWL120_807APRBUT ONLINE 0 0 0 (resilvering)
logs
zfs_slog ONLINE 0 0 0
cache
zfs_l2arc ONLINE 0 0 0
errors: No known data errors
I'm really shocked what's going on
Solution 1:
Well, looks like you answered yourself - disks were too hot so they started failing. See if you can recover from that degraded state.
Also, check your RAM. Do full memtest. If they are ok, check SATA cables too. Check all SMART stats and to test=long on all of them via smartctl. And never overheat your HDDs.