Why is ZFS so much slower than ext4 and btrfs?

Problem

I recently installed a new disk and created a zpool on it:

/# zpool create morez /dev/sdb

After using it for a while, I noticed it was quite slow:

/morez# fio --name rw --rw rw --size 10G
   read: IOPS=19.6k, BW=76.6MiB/s (80.3MB/s)(5120MiB/66834msec)
  write: IOPS=19.6k, BW=76.6MiB/s (80.3MB/s)(5120MiB/66834msec)

This test is fairly similar to my actual use case. I am reading a moderate number (~10k) of images (~2 MiB each) from disk. They were written all at once when the disk was mostly empty, so I do not expect them to be fragmented.

For comparison, I tested out ext4:

/# gdisk /dev/sdb
...
/# mkfs.ext4 -f /dev/sdb1 && mount /dev/sdb1 /mnt && cd /mnt
/mnt# fio --name rw --rw rw --size 10G
   read: IOPS=48.3k, BW=189MiB/s (198MB/s)(5120MiB/27135msec)
  write: IOPS=48.3k, BW=189MiB/s (198MB/s)(5120MiB/27135msec)

And btrfs:

/# mkfs.btrfs -f /dev/sdb1 && mount /dev/sdb1 /mnt && cd /mnt
/mnt# fio --name rw --rw rw --size 10G
   read: IOPS=51.3k, BW=201MiB/s (210MB/s)(5120MiB/25528msec)
  write: IOPS=51.3k, BW=201MiB/s (210MB/s)(5120MiB/25528msec)

What might be causing the performance issues with ZFS and how can I make it faster?

Failed attempt at a solution

I also tried explicitly setting the sector size for the zpool, as my disk (Seagate ST1000DM003) uses 4096 byte physical sectors:

/# zpool create -o ashift=12 morez /dev/sdb

This did not improve the performance:

/morez# fio --name rw --rw rw --size 10G
   read: IOPS=21.3k, BW=83.2MiB/s (87.2MB/s)(5120MiB/61573msec)
  write: IOPS=21.3k, BW=83.2MiB/s (87.2MB/s)(5120MiB/61573msec)

Observation

Strangely, using a zvol had great performance:

/# zfs create -V 20G morez/vol
/# fio --name rw --filename /dev/zvol/morez/vol --rw rw --size 10G
   read: IOPS=52.7k, BW=206MiB/s (216MB/s)(5120MiB/24852msec)
  write: IOPS=52.7k, BW=206MiB/s (216MB/s)(5120MiB/24852msec)

Why does this only impact ZFS filesystems and not zvols?

Extended testing for btrfs

In the comments, it was suggested the difference may be due to caching. After further testing, I do not believe this is the case. I increased the size of the btrfs test well above the amount of memory my computer has and its performance was still significantly greater than that of ZFS:

/# mkfs.btrfs -f /dev/sdb1 && mount /dev/sdb1 /mnt && cd /mnt
/mnt# $ fio --name rw --rw rw --size 500G --runtime 3600 --time_based --ramp_time 900
   read: IOPS=41.9k, BW=164MiB/s (172MB/s)(576GiB/3600003msec)
  write: IOPS=41.9k, BW=164MiB/s (172MB/s)(576GiB/3600003msec)

System info

Software

Arch Linux, kernel version 4.11.6
ZFS on Linux 0.6.5.10
fio 2.21

Hardware

Drive being tested: Seagate ST1000DM003, connected to 6Gb/s SATA port
Motherboard: Gigabyte X99-SLI
Memory: 8 GiB

ZFS info

Here are what the ZFS properties looked like before running fio. These are just the result of creating a zpool with the default settings.

# zpool get all morez
NAME   PROPERTY                    VALUE            SOURCE
morez  size                        928G             -
morez  capacity                    0%               -
morez  altroot                     -                default
morez  health                      ONLINE           -
morez  guid                        [removed]        default
morez  version                     -                default
morez  bootfs                      -                default
morez  delegation                  on               default
morez  autoreplace                 off              default
morez  cachefile                   -                default
morez  failmode                    wait             default
morez  listsnapshots               off              default
morez  autoexpand                  off              default
morez  dedupditto                  0                default
morez  dedupratio                  1.00x            -
morez  free                        928G             -
morez  allocated                   276K             -
morez  readonly                    off              -
morez  ashift                      0                default
morez  comment                     -                default
morez  expandsize                  -                -
morez  freeing                     0                default
morez  fragmentation               0%               -
morez  leaked                      0                default
morez  feature@async_destroy       enabled          local
morez  feature@empty_bpobj         enabled          local
morez  feature@lz4_compress        active           local
morez  feature@spacemap_histogram  active           local
morez  feature@enabled_txg         active           local
morez  feature@hole_birth          active           local
morez  feature@extensible_dataset  enabled          local
morez  feature@embedded_data       active           local
morez  feature@bookmarks           enabled          local
morez  feature@filesystem_limits   enabled          local
morez  feature@large_blocks        enabled          local

# zfs get all morez
NAME   PROPERTY              VALUE                  SOURCE
morez  type                  filesystem             -
morez  creation              Thu Jun 29 19:34 2017  -
morez  used                  240K                   -
morez  available             899G                   -
morez  referenced            96K                    -
morez  compressratio         1.00x                  -
morez  mounted               yes                    -
morez  quota                 none                   default
morez  reservation           none                   default
morez  recordsize            128K                   default
morez  mountpoint            /morez                 default
morez  sharenfs              off                    default
morez  checksum              on                     default
morez  compression           off                    default
morez  atime                 on                     default
morez  devices               on                     default
morez  exec                  on                     default
morez  setuid                on                     default
morez  readonly              off                    default
morez  zoned                 off                    default
morez  snapdir               hidden                 default
morez  aclinherit            restricted             default
morez  canmount              on                     default
morez  xattr                 on                     default
morez  copies                1                      default
morez  version               5                      -
morez  utf8only              off                    -
morez  normalization         none                   -
morez  casesensitivity       sensitive              -
morez  vscan                 off                    default
morez  nbmand                off                    default
morez  sharesmb              off                    default
morez  refquota              none                   default
morez  refreservation        none                   default
morez  primarycache          all                    default
morez  secondarycache        all                    default
morez  usedbysnapshots       0                      -
morez  usedbydataset         96K                    -
morez  usedbychildren        144K                   -
morez  usedbyrefreservation  0                      -
morez  logbias               latency                default
morez  dedup                 off                    default
morez  mlslabel              none                   default
morez  sync                  standard               default
morez  refcompressratio      1.00x                  -
morez  written               96K                    -
morez  logicalused           72.5K                  -
morez  logicalreferenced     40K                    -
morez  filesystem_limit      none                   default
morez  snapshot_limit        none                   default
morez  filesystem_count      none                   default
morez  snapshot_count        none                   default
morez  snapdev               hidden                 default
morez  acltype               off                    default
morez  context               none                   default
morez  fscontext             none                   default
morez  defcontext            none                   default
morez  rootcontext           none                   default
morez  relatime              off                    default
morez  redundant_metadata    all                    default
morez  overlay               off                    default

While old, I feel this question deserves an answer.

fio issues, by default, 4KB-sized IOPs; ZFS datasets, instead, use 128KB recordize by default. This mismatch means that each 4K write causes a read/modify/write of the entire 128K record.

ZVOLs, on the other hands, use 8K volblocksize by default. This means that a 4K write cause a much smaller read/modify/write cycle of an 8K record and, with some luck, two 4K writes can be coalesced into a single 8K write (which require no read/modify/write at all).

ZFS dataset recordsize can be changed with zfs set recordize=8K <dataset> and, in this case, it should give more-or-less equivalent performance then ZVOLs. However, when used for relatively big transfers (OP talked about 2 MB files which, being images, should be entirely read each time they are accessed) it is better to have large recordsize/volblocksize, sometime even larger then default setting (128K).

Note: as the fio job lacks direct=1 (http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-direct ) some amount of the I/O being performed (both reads and write) may be cached by the operating system, distorting your results (and making the numbers artificially high). This itself is further complicated by the following:

ZFS on Linux either doesn't support O_DIRECT (so the open fails) or if it does, then it does so by quietly falling back to buffered I/O (see point 3 of https://github.com/zfsonlinux/zfs/commit/a584ef26053065f486d46a7335bea222cb03eeea ).
In some cases BTRFS and ext4 will make O_DIRECT fall back to buffered I/O.

Be aware O_DIRECT still doing buffered I/O is still allowed because on Linux O_DIRECT is more of a hint (see the references section of https://stackoverflow.com/a/46377629/2732969 ).

If you are in situation where you can't correctly bypass caches it is crucial that you do enough I/O over a big enough area to minimize the impact of caching (unless, of course, you actually want to test caching)...