Limit Linux background flush (dirty pages)
Background flushing on Linux happens when either too much written data is pending (adjustable via /proc/sys/vm/dirty_background_ratio) or a timeout for pending writes is reached (/proc/sys/vm/dirty_expire_centisecs). Unless another limit is being hit (/proc/sys/vm/dirty_ratio), more written data may be cached. Further writes will block.
In theory, this should create a background process writing out dirty pages without disturbing other processes. In practice, it does disturb any process doing uncached reading or synchronous writing. Badly. This is because the background flush actually writes at 100% device speed and any other device requests at this time will be delayed (because all queues and write-caches on the road are filled).
Is there a way to limit the amount of requests per second the flushing process performs, or otherwise effectively prioritize other device I/O?
Solution 1:
After lots of benchmarking with sysbench, I come to this conclusion:
To survive (performance-wise) a situation where
- an evil copy process floods dirty pages
- and hardware write-cache is present (possibly also without that)
- and synchronous reads or writes per second (IOPS) are critical
just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.
Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (echo 15000000 > dirty_bytes
).
This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, the Linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.
Specifications and benchmarks for comparison:
Tested while dd
'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kB from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.
Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).
dd
call:
dd if=/dev/zero of=dumpfile bs=1024 count=20485672
for sysbench, the test_file.0 was prepared to be unsparse with:
dd if=/dev/zero of=test_file.0 bs=1024 count=10485672
sysbench call for 10 threads:
sysbench --test=fileio --file-num=1 --num-threads=10 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run
sysbench call for one thread:
sysbench --test=fileio --file-num=1 --num-threads=1 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run
Smaller block sizes showed even more drastic numbers.
--file-block-size=4096 with 1 GB dirty_bytes:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.
Operations performed: 0 Read, 30 Write, 30 Other = 60 Total
Read 0b Written 120Kb Total transferred 120Kb (3.939Kb/sec)
0.98 Requests/sec executed
Test execution summary:
total time: 30.4642s
total number of events: 30
total time taken by event execution: 30.4639
per-request statistics:
min: 94.36ms
avg: 1015.46ms
max: 1591.95ms
approx. 95 percentile: 1591.30ms
Threads fairness:
events (avg/stddev): 30.0000/0.00
execution time (avg/stddev): 30.4639/0.00
--file-block-size=4096 with 15 MB dirty_bytes:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.
Operations performed: 0 Read, 13524 Write, 13524 Other = 27048 Total
Read 0b Written 52.828Mb Total transferred 52.828Mb (1.7608Mb/sec)
450.75 Requests/sec executed
Test execution summary:
total time: 30.0032s
total number of events: 13524
total time taken by event execution: 29.9921
per-request statistics:
min: 0.10ms
avg: 2.22ms
max: 145.75ms
approx. 95 percentile: 12.35ms
Threads fairness:
events (avg/stddev): 13524.0000/0.00
execution time (avg/stddev): 29.9921/0.00
--file-block-size=4096 with 15 MB dirty_bytes on idle system:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.
Operations performed: 0 Read, 43801 Write, 43801 Other = 87602 Total
Read 0b Written 171.1Mb Total transferred 171.1Mb (5.7032Mb/sec)
1460.02 Requests/sec executed
Test execution summary:
total time: 30.0004s
total number of events: 43801
total time taken by event execution: 29.9662
per-request statistics:
min: 0.10ms
avg: 0.68ms
max: 275.50ms
approx. 95 percentile: 3.28ms
Threads fairness:
events (avg/stddev): 43801.0000/0.00
execution time (avg/stddev): 29.9662/0.00
Test-System:
- Adaptec 5405Z (that's 512 MB write-cache with protection)
- Intel Xeon L5520
- 6 GiB RAM @ 1066 MHz
- Motherboard Supermicro X8DTN (5520 chipset)
- 12 Seagate Barracuda 1 TB disks
- 10 in Linux software RAID 10
- Kernel 2.6.32
- Filesystem xfs
- Debian unstable
In summary, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.
Solution 2:
Even though tuning kernel parameters stopped the problem, it's actually possible your performance issues were the result of a bug on the Adaptec 5405Z controller that was fixed in a Feb 1, 2012 firmware update. The release notes say "Fixed an issue where the firmware could hang during high I/O stress." Perhaps spreading out the I/O as you did was enough to prevent this bug from being triggered, but that's just a guess.
Here are the release notes: http://download.adaptec.com/pdfs/readme/relnotes_arc_fw-b18937_asm-18837.pdf
Even if this wasn't the case for your particular situation, I figured this could benefit users who come across this post in the future. We saw some messages like the following in our dmesg output which eventually led us to the firmware update:
aacraid: Host adapter abort request (0,0,0,0)
[above was repeated many times]
AAC: Host adapter BLINK LED 0x62
AAC0: adapter kernel panic'd 62.
sd 0:0:0:0: timing out command, waited 360s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000000
Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
sd 0:0:0:0: timing out command, waited 360s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000028
Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
sd 0:0:0:0: timing out command, waited 360s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000028
Here are the model numbers of the Adaptec RAID controllers which are listed in the release notes for the firmware that has the high I/O hang fix: 2045, 2405, 2405Q, 2805, 5085, 5405, 5405Z, 5445, 5445Z, 5805, 5805Q, 5805Z, 5805ZQ, 51245, 51645, 52445.
Solution 3:
A kernel which includes "WBT":
Improvements in the block layer, LWN.net
With writeback throttling, [the block layer] attempts to get maximum performance without excessive I/O latency using a strategy borrowed from the CoDel network scheduler. CoDel tracks the observed minimum latency of network packets and, if that exceeds a threshold value, it starts dropping packets. Dropping writes is frowned upon in the I/O subsystem, but a similar strategy is followed in that the kernel monitors the minimum latency of both reads and writes and, if that exceeds a threshold value, it starts to turn down the amount of background writeback that's being done. This behavior was added in 4.10; Axboe said that pretty good results have been seen.
WBT does not require switching to the new blk-mq block layer. That said, it does not work with the CFQ or BFQ I/O schedulers. You can use WBT with the deadline / mq-deadline / noop / none schedulers. I believe it also works with the new "kyber" I/O scheduler.
As well as scaling the queue size to control latency, the WBT code limits the number of background writeback requests as a proportion of the calculated queue limit.
The runtime configuration is in /sys/class/block/*/queue/wbt_lat_usec
.
The build configuration options to look for are
/boot/config-4.20.8-200.fc29.x86_64:CONFIG_BLK_WBT=y
/boot/config-4.20.8-200.fc29.x86_64:# CONFIG_BLK_WBT_SQ is not set
/boot/config-4.20.8-200.fc29.x86_64:CONFIG_BLK_WBT_MQ=y
Your problem statement is confirmed 100% by the author of WBT - well done :-).
[PATCHSET] block: buffered writeback throttling
Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me.
Results from some recent testing can be found here:
https://www.facebook.com/axboe/posts/10154074651342933
See previous postings for a bigger description of the patchset.