Disabling ext4 write barriers when using an external journal
I'm currently experimenting with different ways of improving write speeds to a fairly large, rotating disk-based, software-raid (mdadm) array on Debian using fast NVMe devices.
I found that using a pair of such devices (raid1, mirrored) to store the filesystem's journal yields interesting performance benefits. The mount options I am using to achieve this are noatime,journal_aync_commit,data=journal
.
In my tests, I've also discovered that adding the barrier=0
option offers significant benefits in terms of write performance. However, I'm not certain that this option is safe to use in my particular filesystem configuration. This is what the kernel documentation says about ext4 write barriers:
Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance.
The specific NVMe device I'm using is an Intel DC P3700 which has built-in power-loss protection which means that in the event of an unexpected shutdown, any data still present in temporary buffers is safely committed to NAND storage thanks to reserve energy storage.
So my question is, can I safely disable ext4 write barriers if the journal is stored on a device with battery-backed cache, while the rest of the filesystem itself sits on disks which don't have this feature?
I'm writing a new answer because after further analysis, I don't think the previous answer is correct.
If we look at the write_dirty_buffers
function, it issues a write request with the REQ_SYNC
flag, but it doesn't cause a cache flush, or barrier, to be issued. That is accomplished by the blkdev_issue_flush
call, which is appropriately gated by a verification of the JDB2_BARRIER
flag, which itself is only present when the filesystem is mounted with barriers enabled.
So if we look back at checkpoint.c
, barriers only matter when a transaction is dropped from the journal. The comments in the code are informative here, telling us that this write barrier is unlikely to be necessary, but is there anyway as a safeguard. I think the assumption here is that by the time a transaction is dropped from the journal, the data itself is unlikely to be still lingering in the drive's cache, and not yet committed to permanent storage. But since it's only an assumption, the write barrier is issued anyway.
So why aren't barriers used when writing data to the main filesystem? I think the key here is that as long as the journal is coherent, metadata that's missing from the filesystem (eg. because it was lost in a power-loss event) is normally recovered during the journal replay, thus avoiding filesystem corruption. Furthermore, the use of data=journal
should also guarantee consistency of actual filesystem data because, as I understand it, the recovery process will also write out data blocks that were committed to the journal as part of its replay mechanism.
So while ext4 does not actually flush disk caches at the end of a checkpoint, some steps should be taken to maximize recoverability in case of a power-loss:
The filesystem should be mounted with
data=journal
, and notdata=writeback
(data=ordered
is unavailable when using an external journal). This one should be obvious: we want a copy of all incoming data blocks inside the journal since those are the ones likely to be lost in a power-loss event. This isn't expensive performance-wise, since NVMe devices are very fast.The maximum journal size of 102400 blocks (400MB when using 4K filesystem blocks) should be used, so as to maximize the amount of data that's recoverable in a journal replay. This shouldn't be an issue since all NVMe devices are always at least several gigabytes in size.
Problems may still arise in case an unexpected shutdown happens during a write-intensive operation. If transactions get dropped from the journal device faster than the data drives are able to flush their caches on their own, unrecoverable data loss or filesystem corruption could occur.
So the bottom line is, in my view, is that it's not 100% safe to disable write barriers, although some precautions can be implemented (#1 and #2) to make this setup a little safer.
Another way to put your question is this: when doing a checkpoint, i.e. when writing the data in the journal to the actual filesystem, does ext4 flush out the cache (of the rotating disks, in your case) before marking the transaction as completed and updating the journal accordingly?
If we look at the source code of jbd2 (which is responsible to handle the journalling) in checkpoint.c we see that jbd2_log_do_checkpoint()
calls at the end:
__flush_batch(journal, &batch_count);
which calls:
write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC);
So it seems like it should be safe.
Related: in the past a patch to use WRITE_SYNC in journal checkpoint was also proposed: The reason was that writing the buffers had too low priority and caused the journal to fill up while waiting for the write to complete