Why is ZFS so much slower than ext4 and btrfs?
Problem
I recently installed a new disk and created a zpool on it:
/# zpool create morez /dev/sdb
After using it for a while, I noticed it was quite slow:
/morez# fio --name rw --rw rw --size 10G
read: IOPS=19.6k, BW=76.6MiB/s (80.3MB/s)(5120MiB/66834msec)
write: IOPS=19.6k, BW=76.6MiB/s (80.3MB/s)(5120MiB/66834msec)
This test is fairly similar to my actual use case. I am reading a moderate number (~10k) of images (~2 MiB each) from disk. They were written all at once when the disk was mostly empty, so I do not expect them to be fragmented.
For comparison, I tested out ext4:
/# gdisk /dev/sdb
...
/# mkfs.ext4 -f /dev/sdb1 && mount /dev/sdb1 /mnt && cd /mnt
/mnt# fio --name rw --rw rw --size 10G
read: IOPS=48.3k, BW=189MiB/s (198MB/s)(5120MiB/27135msec)
write: IOPS=48.3k, BW=189MiB/s (198MB/s)(5120MiB/27135msec)
And btrfs:
/# mkfs.btrfs -f /dev/sdb1 && mount /dev/sdb1 /mnt && cd /mnt
/mnt# fio --name rw --rw rw --size 10G
read: IOPS=51.3k, BW=201MiB/s (210MB/s)(5120MiB/25528msec)
write: IOPS=51.3k, BW=201MiB/s (210MB/s)(5120MiB/25528msec)
What might be causing the performance issues with ZFS and how can I make it faster?
Failed attempt at a solution
I also tried explicitly setting the sector size for the zpool, as my disk (Seagate ST1000DM003) uses 4096 byte physical sectors:
/# zpool create -o ashift=12 morez /dev/sdb
This did not improve the performance:
/morez# fio --name rw --rw rw --size 10G
read: IOPS=21.3k, BW=83.2MiB/s (87.2MB/s)(5120MiB/61573msec)
write: IOPS=21.3k, BW=83.2MiB/s (87.2MB/s)(5120MiB/61573msec)
Observation
Strangely, using a zvol had great performance:
/# zfs create -V 20G morez/vol
/# fio --name rw --filename /dev/zvol/morez/vol --rw rw --size 10G
read: IOPS=52.7k, BW=206MiB/s (216MB/s)(5120MiB/24852msec)
write: IOPS=52.7k, BW=206MiB/s (216MB/s)(5120MiB/24852msec)
Why does this only impact ZFS filesystems and not zvols?
Extended testing for btrfs
In the comments, it was suggested the difference may be due to caching. After further testing, I do not believe this is the case. I increased the size of the btrfs test well above the amount of memory my computer has and its performance was still significantly greater than that of ZFS:
/# mkfs.btrfs -f /dev/sdb1 && mount /dev/sdb1 /mnt && cd /mnt
/mnt# $ fio --name rw --rw rw --size 500G --runtime 3600 --time_based --ramp_time 900
read: IOPS=41.9k, BW=164MiB/s (172MB/s)(576GiB/3600003msec)
write: IOPS=41.9k, BW=164MiB/s (172MB/s)(576GiB/3600003msec)
System info
Software
- Arch Linux, kernel version 4.11.6
- ZFS on Linux 0.6.5.10
- fio 2.21
Hardware
- Drive being tested: Seagate ST1000DM003, connected to 6Gb/s SATA port
- Motherboard: Gigabyte X99-SLI
- Memory: 8 GiB
ZFS info
Here are what the ZFS properties looked like before running fio. These are just the result of creating a zpool with the default settings.
# zpool get all morez
NAME PROPERTY VALUE SOURCE
morez size 928G -
morez capacity 0% -
morez altroot - default
morez health ONLINE -
morez guid [removed] default
morez version - default
morez bootfs - default
morez delegation on default
morez autoreplace off default
morez cachefile - default
morez failmode wait default
morez listsnapshots off default
morez autoexpand off default
morez dedupditto 0 default
morez dedupratio 1.00x -
morez free 928G -
morez allocated 276K -
morez readonly off -
morez ashift 0 default
morez comment - default
morez expandsize - -
morez freeing 0 default
morez fragmentation 0% -
morez leaked 0 default
morez feature@async_destroy enabled local
morez feature@empty_bpobj enabled local
morez feature@lz4_compress active local
morez feature@spacemap_histogram active local
morez feature@enabled_txg active local
morez feature@hole_birth active local
morez feature@extensible_dataset enabled local
morez feature@embedded_data active local
morez feature@bookmarks enabled local
morez feature@filesystem_limits enabled local
morez feature@large_blocks enabled local
# zfs get all morez
NAME PROPERTY VALUE SOURCE
morez type filesystem -
morez creation Thu Jun 29 19:34 2017 -
morez used 240K -
morez available 899G -
morez referenced 96K -
morez compressratio 1.00x -
morez mounted yes -
morez quota none default
morez reservation none default
morez recordsize 128K default
morez mountpoint /morez default
morez sharenfs off default
morez checksum on default
morez compression off default
morez atime on default
morez devices on default
morez exec on default
morez setuid on default
morez readonly off default
morez zoned off default
morez snapdir hidden default
morez aclinherit restricted default
morez canmount on default
morez xattr on default
morez copies 1 default
morez version 5 -
morez utf8only off -
morez normalization none -
morez casesensitivity sensitive -
morez vscan off default
morez nbmand off default
morez sharesmb off default
morez refquota none default
morez refreservation none default
morez primarycache all default
morez secondarycache all default
morez usedbysnapshots 0 -
morez usedbydataset 96K -
morez usedbychildren 144K -
morez usedbyrefreservation 0 -
morez logbias latency default
morez dedup off default
morez mlslabel none default
morez sync standard default
morez refcompressratio 1.00x -
morez written 96K -
morez logicalused 72.5K -
morez logicalreferenced 40K -
morez filesystem_limit none default
morez snapshot_limit none default
morez filesystem_count none default
morez snapshot_count none default
morez snapdev hidden default
morez acltype off default
morez context none default
morez fscontext none default
morez defcontext none default
morez rootcontext none default
morez relatime off default
morez redundant_metadata all default
morez overlay off default
While old, I feel this question deserves an answer.
fio
issues, by default, 4KB-sized IOPs; ZFS datasets, instead, use 128KB recordize by default. This mismatch means that each 4K write causes a read/modify/write of the entire 128K record.
ZVOLs, on the other hands, use 8K volblocksize by default. This means that a 4K write cause a much smaller read/modify/write cycle of an 8K record and, with some luck, two 4K writes can be coalesced into a single 8K write (which require no read/modify/write at all).
ZFS dataset recordsize can be changed with zfs set recordize=8K <dataset>
and, in this case, it should give more-or-less equivalent performance then ZVOLs. However, when used for relatively big transfers (OP talked about 2 MB files which, being images, should be entirely read each time they are accessed) it is better to have large recordsize/volblocksize, sometime even larger then default setting (128K).
Note: as the fio job lacks direct=1
(http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-direct ) some amount of the I/O being performed (both reads and write) may be cached by the operating system, distorting your results (and making the numbers artificially high). This itself is further complicated by the following:
- ZFS on Linux either doesn't support
O_DIRECT
(so the open fails) or if it does, then it does so by quietly falling back to buffered I/O (see point 3 of https://github.com/zfsonlinux/zfs/commit/a584ef26053065f486d46a7335bea222cb03eeea ). - In some cases BTRFS and ext4 will make
O_DIRECT
fall back to buffered I/O.
Be aware O_DIRECT
still doing buffered I/O is still allowed because on Linux O_DIRECT
is more of a hint (see the references section of https://stackoverflow.com/a/46377629/2732969 ).
If you are in situation where you can't correctly bypass caches it is crucial that you do enough I/O over a big enough area to minimize the impact of caching (unless, of course, you actually want to test caching)...