Why is ZFS on Linux unable to fully utilize 8x SSDs on AWS i2.8xlarge instance?

I'm completely new to ZFS, so to start with I thought I'd do some simple benchmarks on it to get a feel for how it behaves. I wanted to push the limits of its performance so I provisioned an Amazon EC2 i2.8xlarge instance (almost $7/hr, time really is money!). This instance has 8 800GB SSDs.

I did an fio test on the SSDs themselves, and got the following output (trimmed):

$ sudo fio --name randwrite --ioengine=libaio --iodepth=2 --rw=randwrite --bs=4k --size=400G --numjobs=8 --runtime=300 --group_reporting --direct=1 --filename=/dev/xvdb
[trimmed]
  write: io=67178MB, bw=229299KB/s, iops=57324, runt=300004msec
[trimmed]

57K IOPS for 4K random writes. Respectable.

I then created a ZFS volume spanning all 8. At first I had one raidz1 vdev with all 8 SSDs in it, but I read about the reasons this is bad for performance, so I ended up with four mirror vdevs, like so:

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi
$ sudo zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
testpool  2.91T   284K  2.91T         -     0%     0%  1.00x  ONLINE  -
  mirror   744G   112K   744G         -     0%     0%
    xvdb      -      -      -         -      -      -
    xvdc      -      -      -         -      -      -
  mirror   744G    60K   744G         -     0%     0%
    xvdd      -      -      -         -      -      -
    xvde      -      -      -         -      -      -
  mirror   744G      0   744G         -     0%     0%
    xvdf      -      -      -         -      -      -
    xvdg      -      -      -         -      -      -
  mirror   744G   112K   744G         -     0%     0%
    xvdh      -      -      -         -      -      -
    xvdi      -      -      -         -      -      -

I set the recordsize to 4K and ran my test:

$ sudo zfs set recordsize=4k testpool
$ sudo fio --name randwrite --ioengine=libaio --iodepth=2 --rw=randwrite --bs=4k --size=400G --numjobs=8 --runtime=300 --group_reporting --filename=/testpool/testfile --fallocate=none
[trimmed]
  write: io=61500MB, bw=209919KB/s, iops=52479, runt=300001msec
    slat (usec): min=13, max=155081, avg=145.24, stdev=901.21
    clat (usec): min=3, max=155089, avg=154.37, stdev=930.54
     lat (usec): min=35, max=155149, avg=300.91, stdev=1333.81
[trimmed]

I get only 52K IOPS on this ZFS pool. That's actually slightly worse than one SSD itself.

I don't understand what I'm doing wrong here. Have I configured ZFS incorrectly, or is this a poor test of ZFS performance?

Note I'm using the official 64-bit CentOS 7 HVM image, though I've upgraded to the 4.4.5 elrepo kernel:

$ uname -a
Linux ip-172-31-43-196.ec2.internal 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

I installed ZFS from the zfs repo listed here. I have version 0.6.5.5 of the zfs package.

UPDATE: Per @ewwhite's suggestion I tried ashift=12 and ashift=13:

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi -o ashift=12 -f

and

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi -o ashift=13 -f

Neither of these made any difference. From what I understand the latest ZFS bits are smart enough identifying 4K SSDs and using reasonable defaults.

I did notice however that CPU usage is spiking. @Tim suggested this but I dismissed it however I think I wasn't watching the CPU long enough to notice. There are something like 30 CPU cores on this instance, and CPU usage is spiking up as high as 80%. The hungry process? z_wr_iss, lots of instances of it.

I confirmed compression is off, so it's not the compression engine.

I'm not using raidz, so it shouldn't be the parity computation.

I did a perf top and it shows most of the kernel time spent in _raw_spin_unlock_irqrestore in z_wr_int_4 and osq_lock in z_wr_iss.

I now believe there is a CPU component to this performance bottleneck, though I'm no closer to figuring out what it might be.

UPDATE 2: Per @ewwhite and others' suggestion that it's the virtualized nature of this environment that creates performance uncertainty, I used fio to benchmark random 4K writes spread across four of the SSDs in the environment. Each SSD by itself gives ~55K IOPS, so I expected somewhere around 240K IOs across four of them. That's more or less what I got:

$ sudo fio --name randwrite --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --size=398G --numjobs=8 --runtime=300 --group_reporting --filename=/dev/xvdb:/dev/xvdc:/dev/xvdd:/dev/xvde
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
...
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8
fio-2.1.5
Starting 8 processes
[trimmed]
  write: io=288550MB, bw=984860KB/s, iops=246215, runt=300017msec
    slat (usec): min=1, max=24609, avg=30.27, stdev=566.55
    clat (usec): min=3, max=2443.8K, avg=227.05, stdev=1834.40
     lat (usec): min=27, max=2443.8K, avg=257.62, stdev=1917.54
[trimmed]

This clearly shows the environment, virtualized though it may be, can sustain the IOPS much higher than what I'm seeing. Something about the way ZFS is implemented is keeping it from hitting the top speed. I just can't figure out what that is.


Solution 1:

This setup may not be tuned well. There are parameters needed for both the /etc/modprobe/zfs.conf file and the ashift value when using SSDs

Try ashift=12 or 13 and test again.


Edit:

This is still a virtualized solution, so we don't know too much about the underlying hardware or how everything is interconnected. I don't know that you'll get better performance out of this solution.


Edit:

I guess I don't see the point of trying to optimize a cloud instance in this manner. Because if top performance were the aim, you'd be using hardware, right?

But remember that ZFS has a lot of tunable settings, and what you get by default isn't anywhere close to your use case.

Try the following in your /etc/modprobe.d/zfs.conf and reboot. It's what I use in my all-SSD data pools for application servers. Your ashift should be 12 or 13. Benchmark with compression=off, but use compression=lz4 in production. Set atime=off. I'd leave recordsize as default (128K).

options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
options zfs zfs_vdev_sync_write_min_active=64
options zfs zfs_vdev_sync_write_max_active=128
options zfs zfs_vdev_sync_read_min_active=64
options zfs zfs_vdev_sync_read_max_active=128
options zfs zfs_vdev_async_read_min_active=64
options zfs zfs_vdev_async_read_max_active=128
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=64
options zfs zfs_prefetch_disable=1

Solution 2:

It seems likely that you're waiting on a Linux kernel mutex lock that in turn may be waiting on a Xen ring buffer. I can't be certain of this without access to a similar machine, but I'm not interested in paying Amazon $7/hour for that privilege.

Longer write-up is here: https://www.reddit.com/r/zfs/comments/4b4r1y/why_is_zfs_on_linux_unable_to_fully_utilize_8x/d1e91wo ; I'd rather it be in one place than two.