zfs checksum errors on Solaris 11 under KVM
Synopsis: libvirt 5.6.0, QEMU 4.1.1, Linux kernel 5.5.10-200, Fedora Server 31.
Solaris 11.4 fresh install (with Solaris 10 branded zones), raw disk on XFS (unfortunately, no possibility to switch to ZFS on Linux and provide a passthrough ZVOL to VM). When I copy a large gzipped file on a ZFS dataset on Solaris VM, zpool get some zfs errors, when I gunzip the file, the gunzipped file becomes corrupted.
Firstly the Solaris VM was hosted on a qcow2 virtual disks, I thought that CoW on CoW is probably the bad idea, so I switched to Raw. Nothing really changed.
Ideas, anyone (I'm acually out of any) ? Solaris 11.4 datasets itself arent't corrupoted. I also successfully run FreeBSD/zfs on a similar setups under KVM (however, using ZVOLs, but still on Linux - no checksum errors there).
Pristine pool:
pool: oracle
state: ONLINE
scan: scrub repaired 0 in 28s with 0 errors on Mon Mar 22 09:58:30 2021
config:
NAME STATE READ WRITE CKSUM
oracle ONLINE 0 0 0
c3d0 ONLINE 0 0 0
errors: No known data errors
Copyig file:
[root@s10-zone ~]# cd /opt/oracle/exchange/
[root@s10-zone exchange]# scp [email protected]:/Backup/oracle/expdp/lcomsys.dmp.gz .
Password:
lcomsys.dmp.gz 100% |*********************************************************************| 27341 MB 2:23:09
Ran a scrub after the copying was finished:
pool: oracle
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://support.oracle.com/msg/ZFS-8000-8A
scan: scrub repaired 6.50K in 5m16s with 3 errors on Tue Mar 23 09:36:34 2021
config:
NAME STATE READ WRITE CKSUM
oracle ONLINE 0 0 3
c3d0 ONLINE 0 0 10
errors: Permanent errors have been detected in the following files:
/system/zones/s10-zone/root/opt/oracle/exchange/lcomsys.dmp.gz
This is how the solaris virtual disks are attached:
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/var/vms/disks/solaris11.img'/>
<backingStore/>
<target dev='sda' bus='sata'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/var/vms/iso/sol-11_4-text-x86.iso'/>
<backingStore/>
<target dev='hda' bus='ide'/>
<readonly/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/var/vms/disks/solaris10-data.img'/>
<backingStore/>
<target dev='hdb' bus='ide'/>
<address type='drive' controller='0' bus='0' target='0' unit='1'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/var/vms/disks/solaris11-data.img'/>
<backingStore/>
<target dev='hdc' bus='ide'/>
<address type='drive' controller='0' bus='1' target='0' unit='0'/>
</disk>
Weird, but, considering the rpool not becoming corrupted, I've changed disk definitions for VM to sata:
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/var/vms/disks/solaris10-data.img'/>
<backingStore/>
<target dev='sdb' bus='sata'/>
<address type='drive' controller='1' bus='0' target='0' unit='0'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/var/vms/disks/solaris11-data.img'/>
<backingStore/>
<target dev='sdc' bus='sata'/>
<address type='drive' controller='2' bus='0' target='0' unit='0'/>
</disk>
And the zfs checksum corruption magically stopped.