Tuning Linux disk caching behaviour for maximum throughput
I'm running into a maximum throughput issue here and need some advice on which way to tune my knobs. We're running a 10Gbit fileserver for backup distribution. It's a two disk S-ATA2 setup on an LSI MegaRAID Controller. The server also got 24gig of memory.
We have a need to mirror our last uploaded backup with maximum throughput.
The RAID0 for our "hot" backups gives us around 260 MB/sec write and 275 MB/sec read. A tested tmpfs with size 20GB gives us around 1GB/sec. This kind of throughput is what we need.
Now how can I tune the virtual memory subsystem of Linux to cache the last uploaded files for as long as possible in memory without writing them out to disk (or even better: writing to disk AND keeping them in memory)?
I setup the following sysctls, but they dont give us the throughput we expect:
# VM pressure fixes
vm.swappiness = 20
vm.dirty_ratio = 70
vm.dirty_background_ratio = 30
vm.dirty_writeback_centisecs = 60000
This should in theory give us 16GB for caching I/O and wait some minutes until its writing to disk. Still when I benchmark the server I see no effect on writing, the throughput doesnt increase.
Help or advice needed.
By the look at the variables you've set, it seems like you are mostly concerned with write performance and do not care about possible data losses due to power outages.
You only will ever get the option for lazy writes and the use of a writeback cache with asynchronous write operations. Synchronous write operations require committing to disk and would not be lazy-written - ever. Your filesystem might be causing frequent page flushes and synchronous writes (typically due to journalling, especially with ext3 in data=journal mode). Additionally, even "background" page flushes will interfere with uncached reads and synchronous writes, thus slowing them down.
In general, you should take some metrics to see what is happening - do you see your copy process put in "D" state waiting for I/O work to be done by pdflush? Do you see heavy synchronous write activity on your disks?
If all else fails, you might choose to set up an explicit tmpfs filesystem where you copy your backups to and just synchronize data with your disks after the fact - even automatically using inotify
For read caching things are significantly simpler - there is the fcoretools fadvise
utility which has the --willneed
parameter to advise the kernel to load the file's contents into the buffer cache.
Edit:
vm.dirty_ratio = 70
This should in theory give us 16GB for caching I/O and wait some minutes until its writing to disk.
This would not have greatly influenced your testing scenario, but there is a misconception in your understanding. The dirty_ratio parameter is not a percentage of your system's total memory but rather of your system's free memory.
There is an article about Tuning for Write-Heavy loads with more in-depth information.
Or just get more disks... The drive array configuration you have does not support the throughout you require. This is a case where the solution should be reengineered to meet your real needs. I understand that this is only backup, but it makes sense to avoid a kludgy fix.
Using memory cache may imply in data loss as if something goes wrong the data that are in memory and are not saved to disks will be lost.
That said, there are tuning to be done at filesystem level.
For example, If you were using ext4, you could try the mount option:
barrier=0
That: "disables the use of write barriers in the jbd code. Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance. The mount options "barrier" and "nobarrier" can also be used to enable or disable barriers, for consistency with other ext4 mount options."
More at: http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt