What does a permanent ZFS error indicate?

Several permanent errors were reported on my zpool today.

  pool: seagate3tb
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        seagate3tb  ONLINE       0     0    28
          sda       ONLINE       0     0    56

errors: Permanent errors have been detected in the following files:

        /mnt/seagate3tb/Install.iso
        /mnt/seagate3tb/some-other-file1.txt
        /mnt/seagate3tb/some-other-file2.txt

Edit: I'm sure sure if those CKSUM values are accurate. I was redacting data and may have mangled those by mistake. They may have been 0. Unfortunately, I can't find a conclusive answer in my notes and the errors are resolved now so I'm not sure, but everything else is accurate/reflects what zpool was reporting.

/mnt/seagate3tb/Install.iso is one example file reported as having a permanent error.

Here's where I get confused. If I compare my "permanently errored" Install.iso against a backup of that exact same file on another filesystem, they look identical.

shasum "/mnt/seagate3tb/Install.iso"
1ade72fe65902b2a978e5504aaebf9a3a08bc328  /mnt/seagate3tb/Install.iso
shasum "/mnt/backup/Install.iso"
1ade72fe65902b2a978e5504aaebf9a3a08bc328  /mnt/backup/Install.iso
cmp /mnt/seagate3tb/Install.iso /mnt/backup/Install.iso
diff /mnt/seagate3tb/Install.iso /mnt/backup/Install.iso

The files seem to be identical. What's more, the file works perfectly fine. If I use it in an application, it behaves like I'd expect it to.

As the docs state:

Data corruption errors are always fatal.

But based on my rudimentary file verifications, I'm not sure I understand the definition of fatal.

status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the entire pool from backup.

Maybe I'm missing something, but the file seems perfectly fine as far as I can tell, and does need any restoration nor does it show any corruption, despite the reccomendation from ZFS.

I've seen other articles with the same error, but I have yet to find an answer to my question.

What is the permanent error with the file? Is there some lower level issue with the file that's just not readily apparent to me? If so, why would that not be detected by a shasum as a difference in the file?

From a layperson's perspective, I see nothing to indicate any error with this file.


Solution 1:

The wording of zpool status is a bit misleading. A permanent error (in this context) indicates that an I/O error has occurred and has been logged to the SPA (Storage Pool Allocator) error log for that pool. This does not necessarily mean there is irrecoverable data corruption.

What you should do is run a zpool scrub on the pool. When the scrub completes, the SPA error log will be rotated and will no longer show errors from before the scrub. If the scrub detects no errors then zpool status will no longer show any "permanent" errors.

Regarding the documentation, it is saying that only "fatal errors" are logged in this way. A fatal error is an I/O error that could not be automatically corrected by ZFS and therefore was exposed to an application as a failed I/O. By contrast, if the I/O was immediately retried successfully or if the logical I/O was satisfied from a redundant device, it would not be considered a fatal error and therefore would not be logged as a data corruption error.

A fatal error does not necessarily mean permanent data loss, it just means that at the time it could not be fixed before it propagated up to the application. For example, a loose cable or a bad controller could cause temporary fatal errors which ZFS would describe as "permanent." Whether it truly is a problem depends on the nature of the I/O and whether the application is capable of recovering from I/O errors.

EDIT: Fully agree with @bahamat that you should invest in redundancy as soon as possible.

Solution 2:

A permanent error means that there has been a checksum error in the file and there were not sufficient replicas to repair. It means that at least one read returned corrupted data due to an I/O error. If whatever received the read, then wrote that back to the same disk file you would now have irrecoverable data corruption.

Looking at your pool configuration, it looks like you have no redundancy. This is very dangerous. You don't get any of the self-healing benefits of ZFS, but it will be able to tell you when there has been data corruption. Ordinarily ZFS will automatically and silently correct corrupted reads, but in your case it can't. It also looks like you've already run zpool clear because the CKSUM count is 0 for both drives.

Unfortunately, with no replicas there's really no way to know.