Linux file system cache: Move data from Dirty to Writeback

My software RAID can write 800 MB/s sustained. I see that happening when cat /proc/meminfo |grep Writeback: returns > 2 GB. However, most of the time the writeback is round 0.5 GB which gives a performance around 200 MB/s.

There is plenty of data to be written. cat /proc/meminfo |grep Dirty: says the dirty cache is 90 GB.

As I understand Dirty is what needs to be written, whereas Writeback is what is actively being written to disk. So there may be blocks in Dirty that are located on the disk just next to blocks in Writeback, and these will not be written in the same go.

This can explain why I get much worse performance if Writeback is small as the time spent seeking is much longer that the time spent writing a few extra MB.

So my question is: Can I somehow tell the kernel to move more data from Dirty to Writeback more aggressively and thus increase Writeback?

-- Edit --

This is during low performance:

$ cat /proc/meminfo
MemTotal:       264656352 kB
MemFree:          897080 kB
Buffers:              72 kB
Cached:         233751012 kB
SwapCached:            0 kB
Active:          3825364 kB
Inactive:       230327200 kB
Active(anon):     358120 kB
Inactive(anon):    47536 kB
Active(file):    3467244 kB
Inactive(file): 230279664 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      204799996 kB
SwapFree:       204799996 kB
Dirty:          109921912 kB
Writeback:        391452 kB
AnonPages:        404748 kB
Mapped:            12428 kB
Shmem:               956 kB
Slab:           21974168 kB
SReclaimable:   21206844 kB
SUnreclaim:       767324 kB
KernelStack:        5248 kB
PageTables:         7152 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    337128172 kB
Committed_AS:     555272 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      544436 kB
VmallocChunk:   34124336300 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      149988 kB
DirectMap2M:    17649664 kB
DirectMap1G:    250609664 kB


cat /proc/sys/vm/dirty_background_ratio
1

Lowering dirty_writeback_centisecs only chops up Dirty in even smaller bits.


You didn't give the entire /proc/meminfo output and so I don't know whether there are any tuning you have done beforehand.

Two immediate tunable that you can use are these.

/proc/sys/vm/dirty_background_ratio

 dirty_background_ratio

Contains, as a percentage of total system memory, the number of pages at which
the pdflush background writeback daemon will start writing out dirty data.

The default is 10. Increase it to 30 or 40 and test.

/proc/sys/vm/dirty_writeback_centisecs

dirty_writeback_centisecs

The pdflush writeback daemons will periodically wake up and write `old' data
out to disk.  This tunable expresses the interval between those wakeups, in
100'ths of a second.

Setting this to zero disables periodic writeback altogether.

The default is 500. Set it to 300 and test.

Please remember these are not absolute values. You have to go through trial and error to find out what suits your environment most.

I just figured these values out based on the description you provided and assuming that is correct.

If you have the kernel-doc package installed, go to sysctl and then open up vm.txt to read about.


The real problem is that the Linux kernel Dirty page flush algorithm does not scale to large memory sizes, so anytime the Dirty page in /proc/meminfo exceeds around 1GB the writeback speed slows down progressively and eventually the /proc/sys/vm/dirty_ratio or /proc/sys/vm/dirty_bytes limit is exceeded and the kernel starts throttling all writes to keep the Dirty pages from growing any further.

To maintain high write speed (in OPs case up to 800Mb/sec, can easily be 2 Gb/sec for a hardware RAID controller with cache) you need to counter intuitively lower the /proc/sys/vm/dirty_bytes and dirty_background_bytes to 256M and 64M respectively

Make sure you do a sync first otherwise the system will freeze on writes for several hours until the Dirty page value in /proc/meminfo drops below the new value in /proc/sys/vm/dirty_bytes. The sync will also take several hours, but at least the system will not be frozen during this time.