Why does mdadm write unusably slow when mounted synchronously?
Quote by questioner:
But there is still Linux caching:
root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10 10+0 records in 10+0 records out 10485760 bytes (10 MB) copied, 0.00566339 s, 1.9 GB/s
To disable Linux caching, we can mount the filesystem synchronously:
mount -o remount,sync /mnt/raid6
That's not quite right... sync doesn't simply disable caching like you want in a benchmark. It makes every write result in a "sync" command, which means flushing cache all the way to the disk.
Here is a server over here, to explain better:
$ dd if=/dev/zero of=testfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.183744 s, 2.9 GB/s
$ dd if=/dev/zero of=testfile bs=1M count=500 conv=fdatasync
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 5.22062 s, 100 MB/s
conv=fdatasync simply means flush after the write, and tell you the time including that flush. Alternatively, you can do:
$ time ( dd if=/dev/zero of=testfile bs=1M count=500 ; sync )
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.202687 s, 2.6 GB/s
real 0m2.950s
user 0m0.007s
sys 0m0.339s
And then calculate MB/s from the 2.95s real time rather than the above 0.2s. But that is uglier, and more work, since the stats printed by dd are not including the sync.
If you used "sync" you would flush every write... maybe that means every block, which would run very slow. "sync" should only be used on very strict systems, eg. databases where the loss of one single transaction due to a disk failure is unacceptable (eg. if I transfer a billion bucks from my bank account to yours, and the system crashes, and suddenly you have the money but so do I).
Here is another explanation with additional options, one I read about long ago. http://romanrm.ru/en/dd-benchmark
And one more note: Your benchmark you are doing this way is totally valid in my opinion, although not valid in many others' opinions. But it is not a real-life test. It is a single threaded sequential write. If your real life use case is like that, eg. sending some big files over the network, then it may be a good benchmark. If your use case is different, eg. an ftp server with 500 people uploading small files at the same time, then it is not very good.
And also, you should use a randomly generated file on RAM for best results. It should be random data because some file systems are too smart when you feed them zeros. eg. on Linux using the ram file system tmpfs which is mounted on /dev/. And it should be a RAM fs instead of using /dev/urandom directly because /dev/random is really slow, and /dev/urandom is faster (eg. 75MB/s) but still slower than hdd.
dd if=/dev/urandom of=/dev/shm/randfile bs=1M count=500
dd if=/dev/shm/randfile bs=1M count=500 conv=fdatasync
Performance is dramatically worse because synchronous writing forces parity computation to hammer the disks.
In general, computing and writing parity is a relatively slow process, especially with RAID 6--in your case, not only does md have to fragment the data into four chunks, it then computes two chunks of parity for each stripe. In order to improve performance, RAID implementations (including md) will cache recently-used stripes in memory in order to compare the data to be written with the existing data and quickly recompute parity on write. If new data is written to a cached stripe, it can compare, fragment, and recompute parity without ever touching the disk, then flush it later. You've created a situation where md always misses the cache, in which case it has to read the stripe from disk, compare the data, fragment the new data, recompute parity, then flush the new stripe directly to disk. What would require zero reads and writes from/to disk on a cache hit becomes six reads and six writes for every stripe written.
Granted, the difference in performance you've observed is enormous (1.9GB/s versus 449KB/s), but I think it's all accounted for in how much work md is doing to maintain the integrity of the data.
This performance hit may be compounded by how you have the disks arranged... if you have them all on one controller, that much extra reading and writing will bring performance to a standstill.