Fastest Linux Filesystem on Shingled Disks
There is considerable interest in shingled drives. These put data tracks so close together that you can't write to one track without clobbering the next. This may increase capacity by 20% or so, but results in write amplification problems. There is work underway on filesystems optimised for Shingled drives, for example see: https://lwn.net/Articles/591782/
Some shingled disks such as the Seagate 8TB archive have a cache area for random writes, allowing decent performance on generic filesystems. The disk can even be quite fast on some common workloads, up to round 200MB/sec writes. However, it is to be expected that if the random write cache overflows, the performance may suffer. Presumably, some filesystems are better at avoiding random writes in general, or patterns of random writes likely to overflow the write cache found in such drives.
Is a mainstream filesystem in the linux kernel better at avoiding the performance penalty of shingled disks than ext4?
Intuitively Copy-on-Write and Log structured filesystems might give better performance on shingled disks by reducing reduce random writes. The benchmarks somewhat support this, however, these differences in performance are not specific to shingled disks. They also occur on an unshingled disk used as a control. Thus the switching to a shingled disk might not have much relevance to your choice of filesystem.
The nilfs2 filesystem gave quite good performance on SMR disk. However, this was because I allocated the whole 8TB partition, and the benchmark only wrote ~0.5TB so the nilfs cleaner did not have to run. When I limited the partition to 200GB the nilfs benchmarks did not even complete successfully. Nilfs2 may be a good choice performance-wise if you really use the "archive" disk as an archive disk where you keep all the data and snapshots written to the disk forever, as then then nilfs cleaner does not have to run.
I understand that the 8TB seagate ST8000AS0002-1NA17Z
drive I used for the test has a ~20GB cache area. I made changed the default filebench fileserver settings so that the benchmarks set would be ~125GB, larger than the unshingled cache area:
set $meanfilesize=1310720
set $nfiles=100000
run 36000
Now for the actual data. The number of ops measures the "overall" fileserver performance while the ms/op measures the latency of the random append, and could be used as a rough guide to the performance of random writes.
$ grep rand *0.out | sed s/.0.out:/\ / |sed 's/ - /-/g' | column -t
SMR8TB.nilfs appendfilerand1 292176ops 8ops/s 0.1mb/s 1575.7ms/op 95884us/op-cpu [0ms - 7169ms]
SMR.btrfs appendfilerand1 214418ops 6ops/s 0.0mb/s 1780.7ms/op 47361us/op-cpu [0ms-20242ms]
SMR.ext4 appendfilerand1 172668ops 5ops/s 0.0mb/s 1328.6ms/op 25836us/op-cpu [0ms-31373ms]
SMR.xfs appendfilerand1 149254ops 4ops/s 0.0mb/s 669.9ms/op 19367us/op-cpu [0ms-19994ms]
Toshiba.btrfs appendfilerand1 634755ops 18ops/s 0.1mb/s 652.5ms/op 62758us/op-cpu [0ms-5219ms]
Toshiba.ext4 appendfilerand1 466044ops 13ops/s 0.1mb/s 270.6ms/op 23689us/op-cpu [0ms-4239ms]
Toshiba.xfs appendfilerand1 368670ops 10ops/s 0.1mb/s 195.6ms/op 19084us/op-cpu [0ms-2994ms]
Since the Seagate is 5980RPM one might naively expect the Toshiba to be 20% faster. These benchmarks show it as being roughly 3 times (200%) faster, so these benchmarks are hitting the shingled performance penalty. We see Shingled (SMR) disk still can't match the performance ext4 with on a unshingled (PMR) disk. The best performance was with nilfs2 with a 8TB partition (so the cleaner didn't need to run), but even then it was significantly slower than the Toshiba with ext4.
To make the benchmarks above more clear, it might might help to normalise them relative to the performance of ext4 on each disk:
ops randappend
SMR.btrfs: 1.24 0.74
SMR.ext4: 1 1
SMR.xfs: 0.86 1.98
Toshiba.btrfs: 1.36 0.41
Toshiba.ext4: 1 1
Toshiba.xfs: 0.79 1.38
We see that on the SMR disk btrfs has most of the advantage on overall ops that it has on ext4, but penalty on random appends is not as dramatic as a ratio. This might lead one to move to btrfs on the SMR disk. On the other hand, if you need low latency random appends, this benchmark suggests you want xfs, especially on SMR. We see that while SMR/PMR might influence your choice of filesystem, considering the workload your are optimising for seems more important.
I also ran an attic based benchmark. The durations of the attic runs (on the 8TB SMR full disk partitions) were:
ext4: 1 days 1 hours 19 minutes 54.69 seconds
btrfs: 1 days 40 minutes 8.93 seconds
nilfs: 22 hours 12 minutes 26.89 seconds
In each case the attic repositories had the following stats:
Original size Compressed size Deduplicated size
This archive: 1.00 TB 639.69 GB 515.84 GB
All archives: 901.92 GB 639.69 GB 515.84 GB
Adding a second copy of the same 1 TB disk to attic took 4.5 hours on each of these three filesystems. A raw dump of the benchmarks and smartctl
information is at:
http://pastebin.com/tYK2Uj76
https://github.com/gmatht/joshell/tree/master/benchmarks/SMR
If you rsync
from an SMR drive, make sure that filesystem is mounted read-only
or with noatime
option.
Otherwise the SMR drive will need to write a timestamp for each file rsync reads, resulting in a significant performance degradation (from around 80 mb/s down to 3-5 mb/s here) and head wear / clicking noise.
If you already have an rsync job running with poor performance, there is no need to stop it, you can remount the source filesystem doing
sudo mount -o remount,ro /path/to/source/fs
The effect will not be seen immediately, be patient and wait 10 to 20 minutes, until the drive has finished to write out all the data still in its buffers. This advise is tried and tested ok.
This might also apply when rsync
ing to an SMR drive, i.e. if the filesystem tries to update the timestamp after the file has been fully written to disk. This jitters sequential workload and huge bands of data are continuously rewritten, contributing to drive wear. The following may help:
sudo mount -t fs_type -o rw,noatime device /path/to/dest/fs
This has to be done, before rsync is run; other factors may render this option insignificant, i.e. unbuffered FAT/MFT updating, parallelized writes if the filesystem is optimized primarily for SSDs, etc.
Try to use dd bs=32M
and then resize the filesystem on the SMR target, if you want to backup full filesystems anyway (no need to have it mounted and run rsync to transport each and every file in this case).
Actual hardware in use was a Seagate drive managed SMR 8tb consumer drive. Your mileage may vary with other hardware.