High speed network writes with large capacity storage

Synchronous writing mode ensures that the writes end up in a persistent location immediately. With asynchronous writes, data is cached in RAM and the write call finishes right away. The filesystem will schedule the actual writes to final location (hard disk).

In ZFS case, the point of ZIL / SLOG is to act as a fast interim persistent storage, that allows synchronous mode, that is, ensuring writing client that the writes are final. Otherwise the filesystem would need to write the blocks to the hard disk directly, which makes synchronous mode slow.

In your case, if you want to ensure full speed writing of 40 GB of data, then you should increase your RAM size to cover the size of the file.

However, since the FS starts writing to hard disks immediately, you don't need 40GB memory to get full speed for your writes. For example, when the client has written 20GB of data, 10GB could be in RAM cache and the rest 10GB already in hard drive.

So, you need to do some benchmarking to see how much RAM you need in order to get the full speed writes.

I understand this comes with the risk of data loss if the drive fails or power cuts out. This is acceptable as I retain the files on the source machine long enough to retransfer in case of near term data loss

If you can tolerate the loss of up to 5 seconds of writes, you can simply configure ZFS to ignore sync requests with the command zfs set sync=disabled tank

Forcing all writes to go through a SLOG, even a very fast one, is never faster than bypassing sync requests. SLOG is not a classical writeback cache, which absorbs write for de-staging them to the slower tier. Rather, it is a mean to provide low latency persistence by temporarily storing sync write (and only them) in an intermediate fast storage. After some seconds, the very same writes will be transferred from main memory to the main pool. A SLOG is never read until a crash (and recover) happens.

That said, with a single HDD-based mirror vdev you will never be able to saturate a 10 Gbs link. For consistently writing at ~1 GB/s speed, you need at least 10 HDD in raidz2 or 12+ HDD in mirror+striping. Or, even better, you need an all-SSD pool. This even before considering things as recordsize, compression, etc.

EDIT, to clarify SLOG jobs:

To minimize latency for synch writes, ZFS used the so-called ZFS Intent Log (ZIL). In short: each time sync write arrive, ZFS immediately writes them on a temporary pool area called ZIL. This enable writes to immediately return, letting the calling application continue. After some seconds, at transaction commit, any records written to ZIL are replied to the main pool. This does not means that the ZIL is read at each commit; rather, the to-be-written data comes from the main DRAM ARC cache. In other words, the ZIL is a sort of "log-ahead journal" which assure fast data persistence for to-be-written sync data.

This actually means that sync writes are duplicated: they are written both to ZIL and the main pool. Enter the SLOG (separated log device): a device dedicated to sync writes only - ie: it frees the main pool from ZIL traffic. A fast SSD SLOG is important due to HDDs being very slow for sync writes. The SLOG is not your classical writeback cache because:

it only absorb sync writes, completely ignoring normal writes;
it replicates only data that are already cached in ARC.

The two points combined means that a big SLOG is basically wasteful, because it only needs 3x the max size of a ZFS transaction. In other words, a 2-4 GB SLOG is sufficient for most cases, with bigger SLOG only useful in specific setups.

Such a SLOG is key to provide lower latency for random sync writes but, while it can absorb very small spikes of sequential sync writes, this is not its main function. In other words, you can see the ZIL/SLOG as a persistent slice of ARC. The corollary is that you can not expect to write dozen of GBs and hiding the slow main pool speed via the SLOG, because this means that you already have dozens GBs of dirty data inside your RAM-based ARC.

Setting sync=disabled instruct ZFS to threat all writes, even sync ones, as normal async writes. This will bypass any data ZIL/SLOG and if you can accept a 5s dataloss window, it is the faster setting you can ever achieve - even when compared to very fast SLOG as Optane or a RAMdrive. The nice thing about sync=disabled is that it does not disable sync writes for ZFS own metadata and so it does not put your filesystem at risk. This does not means you can use it lightly: as stated multiple times, you should be sure to understand its implications (you can lose the last seconds of unsynched data in case of crash/powerloss).

On the other hand, a classical SSD-based writeback cache as lvmcache and bcache can (more or less) efficiently use hundreds GBs of SSD cache to mask the main pool latency / throughput - specifically because they are full-fledged writeback caches which do not need to have their data inside main memory (on contrary, the main memory flushes itself via these SSD caches).

The reasoning behind ZFS was that the (big) main system memory is your real read/write cache, with the SLOG being a mean to have lower latency for random sync writes.

High speed network writes with large capacity storage

Related

Recent Posts