md raid5 extremely slow on discard
Hi all md raid experts
Running Centos 8 (with all latest updates, kernel 4.18.0-240.1.1.el8_3.x86_64) where I have 3x2T Samsung SSD 860 QVO 2TB in raid5 (to be used as base for some KVM VMs) and when I do something that involves discard it's not just slow, it's way beyond usable. I did create a 1.5T LV and did then "mkfs.ext4" that, After 4h the discard stage told me "Discarding device blocks: 10489856/409148416" and at first I was thinking "4h for 25%, this sucks" but then I realized it's only 2.5% so we talking about a week!
I did break up the raid and did blkdiscard on the 3 individual drives, took about 18 seconds each.
The hardware is a HP Proliant DL380p Gen8 with a Smart Array P420i controller (no special drivers, all using stock Centos drivers.) that I configured for HBA mode so it should be just a passthrough (discard isn't supported at all if using hw raid).
After doing discard on the devices I created the raid again with
mdadm --create /dev/md20 --level=5 --raid-devices=3 --chunk=64 /dev/sd{b,d,e}1
I left it over night to sync up. Then I created a vg and tested lv creation, it took 7 minutes to discard 100M
root@terrok:~ # lvcreate -nDELME -L100M vgssd2 && date && time mkfs.ext4 /dev/vgssd2/DELME && date && time lvremove -f /dev/vgssd2/DELME;date
Logical volume "DELME" created.
Mon Dec 21 12:47:42 EST 2020
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done
Creating filesystem with 102400 1k blocks and 25688 inodes
Filesystem UUID: f7cf3bb6-2764-4eea-9381-c774312f463b
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729
Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done
real 7m20.095s
user 0m0.004s
sys 0m0.195s
Mon Dec 21 12:55:02 EST 2020
Logical volume "DELME" successfully removed
real 7m17.881s
user 0m0.018s
sys 0m0.120s
Mon Dec 21 13:02:20 EST 2020
root@terrok:~ #
As comparison on the same system I also have two WDC WDS100T2B0A-00SM50 (1T SSD) in raid1 and there the discard works so much faster, 4 seconds for 10G.
I did then take two of the samsung SSDs and made a raid1 of them, full speed on discard. Repeated for the other two combinations of drives and no problems. To me this points to some issue with raid5. For now I have two of the SSDs in raid1 with one hot spare and this at least works but is of course 2T less space then I counted on.
Any suggestions on what I can do to make this usable with raid5?
Solution 1:
As demonstrated by your testings, RAID5 is indeed more intensive operations than a simple RAID 1 array. Cause RAID 1 is literally just syncing between two disks that's it.
RAID 5 on the other hand gotta do all of this calculations among three disks and parity them up. That's a lot of work, at least in comparison to a "simple" RAID 1 array.
Also as a side bar, QVO drives aren't ideal for loads like servicing VMs which usually drives' activities are usually at a premium. Neither are parity arrays like RAID 5 and so forth. You might wanna reconsider your deployment strategies with that said and the situation with RAID5 itself.
Solution 2:
I just tackled this issue too. I digged into raid5's driver and found that raid5 breaks the incoming discard request into 4k discard requests on the underlying devices. Moreover, it's been broken for quite a while, so that it practically ignores devices_handle_discard_safely. As a result, all 4k discards are done sync'ly to the underlying device, making it an even slower operation in total. sidenote: I'm bringing it up to LKML soon, so you can stay tuned there. I'm not familiar with any workaround available with existing kernels.