Why is ZFS not doing anything with my disk's duff sector?

I was under the impression that if an I/O error occurs during a read from a ZFS pool, two things will happen:

  1. The failure will be recorded in either the READ or CKSUM statistic of the relevant device, propagating upwards toward the pool level.
    • Redundant data will be used to reconstruct the requested block, return the requested block to the caller and if the duff drive is still functional rewrite the block to it, OR
    • An I/O error will be returned if redundant data is not available to correct for the read error.

It appears that one of the disks in my mirror setup has developed a bad sector. That by itself is not alarming; such things happen, and that's exactly why I have redundancy (a two-way mirror, to be exact). Every time I scrub the pool or read through the files in a particular directory (I haven't bothered yet to determine exactly which file is at fault), the following pops up in dmesg, obviously with varying timestamps:

Nov  1 09:54:26 yeono kernel: [302621.236549] ata6.00: exception Emask 0x0 SAct 0x9c10 SErr 0x0 action 0x0
Nov  1 09:54:26 yeono kernel: [302621.236557] ata6.00: irq_stat 0x40000008
Nov  1 09:54:26 yeono kernel: [302621.236566] ata6.00: failed command: READ FPDMA QUEUED
Nov  1 09:54:26 yeono kernel: [302621.236578] ata6.00: cmd 60/a8:78:18:5a:12/00:00:5c:01:00/40 tag 15 ncq 86016 in
Nov  1 09:54:26 yeono kernel: [302621.236580]          res 41/40:a8:18:5a:12/00:00:5c:01:00/00 Emask 0x409 (media error) <F>
Nov  1 09:54:26 yeono kernel: [302621.236585] ata6.00: status: { DRDY ERR }
Nov  1 09:54:26 yeono kernel: [302621.236589] ata6.00: error: { UNC }
Nov  1 09:54:26 yeono kernel: [302621.238214] ata6.00: configured for UDMA/133

This is a fairly up to date Debian Wheezy, kernel 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2 x86_64, ZoL 0.6.3. Package versions are current at debian-zfs=7~wheezy, libzfs2=0.6.3-1~wheezy, zfs-dkms=0.6.3-1~wheezy, zfs-initramfs=0.6.3-1~wheezy, zfsutils=0.6.3-1~wheezy, zfsonlinux=3~wheezy, linux-image-amd64=3.2+46, linux-image-3.2.0-4-amd64=3.2.63-2. The only package pinning that I know of is for ZoL, for which I have (as provided by the zfsonlinux package):

Package: *
Pin: release o=archive.zfsonlinux.org
Pin-Priority: 1001

Running hdparm -R on the drive reports that Write-Read-Verify is turned on (this is a Seagate, so has that feature and I use it as an extra safety net; the additional write latency is not a problem since my interactive use pattern is very read-heavy):

/dev/disk/by-id/ata-ST4000NM0033-9ZM170_XXXXXXXX:
 write-read-verify =  2

Even given the clear indication that something is amiss, zpool status claims that there is no problem with the pool:

  pool: akita
 state: ONLINE
  scan: scrub repaired 0 in 8h16m with 0 errors on Sat Nov  1 10:46:03 2014
config:

        NAME                        STATE     READ WRITE CKSUM
        akita                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c50065e8414a  ONLINE       0     0     0
            wwn-0x5000c500645b0fec  ONLINE       0     0     0

errors: No known data errors

This error has been showing up in the logs regularly for the last several days now (since Oct 27) so I'm not terribly inclined to write it off as merely a fluke. I run the disks with quite short SCTERC timeouts; 1.5 seconds read (to recover quickly from read errors), 10 seconds write. I have confirmed that these values are active on the drive in question.

smartd keeps pestering me (which in itself is a good thing!) about the fact that the ATA error count is climbing:

The following warning/error was logged by the smartd daemon:

Device: /dev/disk/by-id/ata-ST4000NM0033-9ZM170_XXXXXXXX [SAT], ATA error count increased from 4 to 5

For details see host's SYSLOG.

Running smartctl --attributes on the drive in question yields the following:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   076   063   044    Pre-fail  Always       -       48910012
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       97
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   092   060   030    Pre-fail  Always       -       1698336160
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       9887
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       98
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   095   095   000    Old_age   Always       -       5
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       10
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   058   052   045    Old_age   Always       -       42 (Min/Max 20/45)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       61
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       492
194 Temperature_Celsius     0x0022   042   048   000    Old_age   Always       -       42 (0 11 0 0)
195 Hardware_ECC_Recovered  0x001a   052   008   000    Old_age   Always       -       48910012
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

Nothing glaringly out of the ordinary there. Note that this is an enterprise drive, so five years warranty and rated for 24x7 operation (meaning it's meant to be reliable for over 40,000 hours of operation, compared to the just under 10,000 hours under its belt so far). Notice the number 5 in attribute 187 Reported_Uncorrect; that's where the problem is. Also note the fairly low Start_Stop_Count and Power_Cycle_Count values of under 100 each.

Not that I think it's relevant in this case, but yes, the system does have ECC RAM.

Non-default properties of the root file system on the pool are:

NAME   PROPERTY              VALUE                  SOURCE
akita  type                  filesystem             -
akita  creation              Thu Sep 12 18:03 2013  -
akita  used                  3,14T                  -
akita  available             434G                   -
akita  referenced            136K                   -
akita  compressratio         1.04x                  -
akita  mounted               no                     -
akita  mountpoint            none                   local
akita  version               5                      -
akita  utf8only              off                    -
akita  normalization         none                   -
akita  casesensitivity       sensitive              -
akita  usedbysnapshots       0                      -
akita  usedbydataset         136K                   -
akita  usedbychildren        3,14T                  -
akita  usedbyrefreservation  0                      -
akita  sync                  standard               local
akita  refcompressratio      1.00x                  -
akita  written               0                      -
akita  logicalused           2,32T                  -
akita  logicalreferenced     15K                    -

and correspondingly for the pool itself:

NAME   PROPERTY               VALUE                  SOURCE
akita  size                   3,62T                  -
akita  capacity               62%                    -
akita  health                 ONLINE                 -
akita  dedupratio             1.00x                  -
akita  free                   1,36T                  -
akita  allocated              2,27T                  -
akita  readonly               off                    -
akita  ashift                 12                     local
akita  expandsize             0                      -
akita  feature@async_destroy  enabled                local
akita  feature@empty_bpobj    active                 local
akita  feature@lz4_compress   active                 local

These lists were obtained by running {zfs,zpool} get all akita | grep -v default.

Now for the questions:

  1. Why isn't ZFS reporting anything about the read problem? It's clearly recovering from it.

  2. Why isn't ZFS automatically rewriting the duff sector that the drive is clearly having trouble reading, in turn hopefully triggering a relocation by the drive, given that sufficient redundancy exists for automatic repair in the read request path?


I suspect the ATA driver is retrying the read operation a couple of times when it receives an error before passing the error back to the filesystem driver.

What this means is by the time the ZFS filesystem driver gets the result of the read the data is all there, and correct, but it likely took a bit longer than normal to happen. There is of course no error counter for higher-than-average latency, so nothing logged.

The fact that the SMART value for Reported_Uncorrect is not 0 makes me suspect that the cause of the failure is the disk itself, as opposed to say the SATA cable or SATA controller being flakey.

If this is the case, most likely the disk will eventually die more and start failing to read even after however many retries the block device driver is attempting. As such my advice would be to replace the disk.

Trigging a long SMART test would likely fail on the affected blocks, if you wanted to make the error go away rewriting those blocks (with dd for example) should cause the disk to swap out those sectors, but in my experience is once a drive starts to go it's better to just replace it and be done with it.