How to increase speed of RAID 5 with mdadm + luks + lvm
Solution 1:
The bad recorded performances stem from different factors:
-
mechanical disks are simply very bad at random read/write IO. To discover how bad they can be, simply append
--sync=1
to yourfio
command (short story: they are incredibly bad, at least when compared to proper BBU RAID controllers or powerloss-protected SSDs); -
RAID5 has an inherent write penalty due to stripe read/modify/write. Moreover it is strongly suggested to avoid it on multi-TB mechanical disks due to safety reasons. Having 4 disks, please seriously consider using RAID10 instead;
-
LUKS, providing software-based full-disk encryption, inevitably has its (significant) toll on both reads and writes;
-
using BTRFS, LVM is totally unnecessary. While a fat LVM-based volume will not impair performance in any meaningful way by itself, you are nonetheless inserting another IO layer and exposing yourself to (more) alignment issues;
-
finally, BTRFS itself is no particularly fast. Especially your slow sequential reads can be tracked to BTRFS horrible fragmentation (due it being CoW and enforcing 4K granularity - as a comparison, to obtain good performance from ZFS one should generally select 64K-128K records when using mechanical disks).
To have a baseline performance comparison, I strongly suggest redoing your IO stack measuring random & sequential read/write speed at each step. In other words:
-
create a RAID10 array and run
dd
andfio
on the raw array (without a filesystem); -
if full-disk encryption is really needed, use LUKS to create an encrypted device and re-run
dd
+fio
on the raw encrypted device (again, with no filesystem). Compare to previous results to have an idea of what it means performance-wise; -
try both XFS and BTRFS (running the usual
dd
+fio
quick bench) to understand how two different filesystems behave. If BTRFS is too slow, try replacing it with lvmthin and XFS (but remember that in this case you will lose user data checksum, for which you need yet another layer - dmintegrity - itself commanding a significant performance hit).
If all this seems a mess, well, it really is so. By doing all the above you are just scratching storage performance: one had to consider real application behavior (rather than totally sequential dd
or pure random fio
results), cache hit ratio, IO pattern alignment, etc. But hey - few is much better than nothing, so lets start with something basic.
Solution 2:
The short version: I think it's likely that your problem is that your benchmark is using random writes that are much smaller than your RAID chunk size.
Is the performance problem something you noticed while using the system? Or, is it just that the benchmark results look bad? That 16K random write benchmark is approaching the worst case for that RAID 5 with a big 512K chunk.
RAID 5 has a parity chunk that has to be updated alongside the data. If you had a sequential workload that the kernel could chop up into 512K writes, you'd be simply computing new parity information, then writing the data chunk and parity chunks out. One write in translates to two writes out.
But with 16K writes that are much smaller than the chunk size, you've got to read the old data and the old parity first, then compute the new parity information, and then write out the new data and parity. That's read-read-write-write. One write in translates to four I/O's. With random writes, there's no way for even the best RAID controller on the planet to predict which chunks to cache.
If you're using the array to store large files, then you're in luck: you're just using the wrong benchmark to assess its performance. If you change randwrite
to simply write
in your benchmark so that the writes are sequential, it should get a lot better!
But if your workload is truly made of more random, small writes, then you're going to have to change something about the array. You'd be better served by a 4 disk RAID 10. But still, that's spinning media. It's not going to rock your world. I'd imagine that the performance of RAID 10 should be 2x to 3x what you've got now, something like 275 to 400 IOPS, maybe 4MiB/s to 6MiB/s on that benchmark?
As for using a SSD to cache, perhaps with something like bcache, you'd be eliminating your redundancy. Consider using a RAID 1 of two SSD's for caching? You definitely don't need NVMe here, given the speed of these drives. SATA would be fine.
(BTW, don't sweat partitions vs. raw devices. It doesn't make a difference. Personally, I don't use partitions.)