Why do my SSD read latency benchmarks get markedly worse when I put an XFS filesystem on top?

I am benchmarking a small server box based on the SuperMicro E300-8D. I've installed the latest CentOS 7.5 with the latest updates, 64GB of DDR4-2100 RAM, and a Samsung 970 EVO 1TB NVMe SSD. The OS is installed on a USB stick in the internal USB port, so the SSD is entirely unused except during my benchmarking.

The goal of my testing is to find an optimal concurrency level for this SSD, inspired by the benchmarking approach used by ScyllaDB. To that end I'm using diskplorer which internally uses fio to explore the relationship between concurrency and both IOPS and latency. It produces handy graphs like the ones below. In all cases I'm using a 4K random read workload.

The problem is I'm getting results that make no sense. Here's the first result:

Raw /dev/nvme0n1

$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G 

Winning!

This is fantastic! Samsung's own spec sheet claims 500K read IOPS and with 20 concurrent reads I'm getting almost 600K. The axis on the right is read latency in nanoseconds, the red line is mean latency, and the error bars are 5% and 95% latency. So it looks like the ideal concurrency level for this SSD is about 20 concurrent reads, yielding awesome latency < 100us.

That's just the raw SSD. I'll put XFS on it, which is optimized for async I/O, and I'm sure it won't add any significant overhead...

With new XFS filesystem on /dev/nvme0n1

$ sudo mkfs.xfs /dev/nvme0n1
$ sudo mount /dev/nvme0n1 /mnt
$ sudo ./diskplorer.py --mountpoint=/mnt --filesize=256G 

Whiskey.  Tango.  Foxtrot.

What!? That's awful! It seems XFS has introduced some absurd amount of latency and dramatically reduced IOPS. What could be wrong?

Just in case, reboot the system to clear out the caches, not that caching should be a factor on a brand new file system:

XFS on /dev/nvme0n1 after reboot

$ sudo shutdown -r now
(reboot happens)
$ sudo mount /dev/nvme0n1 /mnt
$ sudo ./diskplorer.py --mountpoint=/mnt --filesize=256G 

So much for turning it off and then on again...

No change. It's not cache related.

At this moment there is a valid XFS filesystem on /dev/nvme0n1, and it is mounted to /mnt. I'm going to repeat the test I did first, on the raw block device, unmounted, while leaving the contents of the XFS filesystem in place.

Raw /dev/nvme0n1 again

$ sudo umount /mnt
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G 

XFS ruined me SSD!!!111oneone

Oh no, XFS ruined my SSD performance! /sarcasm

Clearly, it's not the case that XFS diabolically has ruined my SSD performance, or that XFS is poorly suited for this workload. But what could it be? Even unmounting the disk so XFS isn't involved, performance seems much reduced?

On a hunch, I tried DISCARDing the entire contents of the SSD which should reset the allocation of cells within the disk to its original state...

Raw /dev/nvme0n1 after blkdiscard

$ sudo blkdiscard /dev/nvme0n1
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G

Miraculously, the performance of my SSD is restored. Has the whole world gone mad?

Based on a suggestion from @shodanshok, what if I do a dd onto the SSD after I have "fixed" it by doing a blkdiscard?

Raw /dev/nvme0n1 after blkdiscard then zeroed with dd

$ sudo blkdiscard /dev/nvme0n1
$ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=1M status=progress oflag=direct
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G 

This is an interesting result, and confirms my belief that XFS is not to blame here. Just by filling the SSD with zeroes, read latency and throughput have both significantly deteriorated. So it must be the SSD itself has some optimized read path for unallocated sectors.

Hypothesis

Clearly XFS isn't killing my SSD, and if it were, blkdiscard isn't magically restoring it. I emphasize again these benchmarks are all read benchmarks, so issues with write journaling, write amplification, wear leveling, etc are not applicable.

My theory is that this SSD and perhaps SSDs in general have an optimization in the read path, which detects a read of an unallocated region of the disk and executes a highly optimized code path that sends all zeros back over the PCIe bus.

My question is, does anyone know if that is correct? If so, are benchmarks of new SSDs without filesystems generally suspect, and is this documented anywhere? If this is not correct, does anyone have any other explanation for these bizarre results?


Solution 1:

Most modern SSDs use a page-based mapping table. At first (or after a complete TRIM/UNMAP) the mapping table is empty - ie any LBA returns 0, even if the underlying flash page/block is not completely erased and so its actual value is different than a plain 0.

This means that, after a complete blkdiscard, you are not reading from the flash chip themselves; rather, the controller immediately returns 0 to all your reads. This easily explain your findings.

Some more ancient SSDs use different, less efficient but simpler approaches which always reads from the NAND chip themselves. On such drives the value of a trimmed page/block is sometime undefined, due to the controller not simply marking them as "empty" but rather reading from the NAND each time.

Yes, SSDs are more complex beast that "plain" HDDs: after all, they basically are small, auto-contained, thinly provisioned RAID volumes with their own filesystem/volume management called FTL (flash translation layer).