Generating a lot of dirty pages is blocking synchronous writes

Solution 1:

A couple of things I'd be interested to know the result of.

  1. initially creating the large file with fallocate then writing into it.

  2. Setting dirty_background_bytes much much lower (say 1GiB) and using CFQ as the scheduler. Note that in this test it might be a better representation to run the small in the middle of the big run.

So for option 1, you might find you avoid all the data=ordered semantics as the block allocation is done already (and quickly) because it was pre-allocated via fallocate and metadata is setup prior to the write. It would be useful to test if this really is the case. I have some confidence though it will improve performance.

For option 2, you can use ionice a bit more. Deadline is demonstrably faster than CFQ although CFQ attempts to organize IO per-process such that you find it gives you better share of the IO through each process.

I read somewhere (cant find a source now) that dirty_background_ratio will block writes against the individual committing process (effectively making the big process slower) to prevent one process starving all the others. Given how little information I can find now on that behaviour, I have less confidence this will work.

Oh: I should point out that fallocate relies on extents and you'll need to use ext4.

Solution 2:

I'm replying to my own questions, but if anyone could suggest something better I would be extremely grateful :)

having 4GB dirty memory at the end of the test, I conclude that the IO scheduler has not been called in above test. Is that right?

This is completely wrong. The amount of dirty memory is not a good indicator. This can be easily proved by just running iostat and check that a lot of writing is happening when the dd oflag=sync is running.

is there a way to investigate deeper what is blocked? Any interesting counters to watch?

perf record -e 'jbd:*' -e 'block:*' -ag

For newer kernels, replace jbd by jbd2.

Any idea on the source of the contention?

In fact for an ext3 with data=ordered, the journalling thread is responsible of flushing the data on the disk. The flush happens in the order of the writes. The frequency of flushing can be controlled using the commit option when mounting the file system.

An interesting experiment: mount the file system with commit=60 and disable the writeback thread. When running the first dd, it completes in 2s, and iostat shows that no IO was generated!

When running the second dd with oflag=sync, all the IO generated by the first dd are flushed to the disk.

we are thinking of either reducing the dirty_ratio values, either performing the first dd in synchronous mode.

for the record both solutions give good results. Another good idea is to put those big files on a separate file system (possibly mounted with data=writeback)

This is not linked specifically to SLES11 or older kernels. The same behavior is experienced on all the kernels I've tried.