md raid5 extremely slow on discard

Hi all md raid experts

Running Centos 8 (with all latest updates, kernel 4.18.0-240.1.1.el8_3.x86_64) where I have 3x2T Samsung SSD 860 QVO 2TB in raid5 (to be used as base for some KVM VMs) and when I do something that involves discard it's not just slow, it's way beyond usable. I did create a 1.5T LV and did then "mkfs.ext4" that, After 4h the discard stage told me "Discarding device blocks: 10489856/409148416" and at first I was thinking "4h for 25%, this sucks" but then I realized it's only 2.5% so we talking about a week!

I did break up the raid and did blkdiscard on the 3 individual drives, took about 18 seconds each.

The hardware is a HP Proliant DL380p Gen8 with a Smart Array P420i controller (no special drivers, all using stock Centos drivers.) that I configured for HBA mode so it should be just a passthrough (discard isn't supported at all if using hw raid).

After doing discard on the devices I created the raid again with

mdadm --create /dev/md20 --level=5 --raid-devices=3 --chunk=64 /dev/sd{b,d,e}1

I left it over night to sync up. Then I created a vg and tested lv creation, it took 7 minutes to discard 100M

root@terrok:~ # lvcreate -nDELME -L100M vgssd2  && date && time mkfs.ext4 /dev/vgssd2/DELME && date && time lvremove -f /dev/vgssd2/DELME;date
  Logical volume "DELME" created.
Mon Dec 21 12:47:42 EST 2020
mke2fs 1.45.6 (20-Mar-2020)
Discarding device blocks: done
Creating filesystem with 102400 1k blocks and 25688 inodes
Filesystem UUID: f7cf3bb6-2764-4eea-9381-c774312f463b
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done


real    7m20.095s
user    0m0.004s
sys     0m0.195s
Mon Dec 21 12:55:02 EST 2020
  Logical volume "DELME" successfully removed

real    7m17.881s
user    0m0.018s
sys     0m0.120s
Mon Dec 21 13:02:20 EST 2020
root@terrok:~ #

As comparison on the same system I also have two WDC WDS100T2B0A-00SM50 (1T SSD) in raid1 and there the discard works so much faster, 4 seconds for 10G.

I did then take two of the samsung SSDs and made a raid1 of them, full speed on discard. Repeated for the other two combinations of drives and no problems. To me this points to some issue with raid5. For now I have two of the SSDs in raid1 with one hot spare and this at least works but is of course 2T less space then I counted on.

Any suggestions on what I can do to make this usable with raid5?


Solution 1:

As demonstrated by your testings, RAID5 is indeed more intensive operations than a simple RAID 1 array. Cause RAID 1 is literally just syncing between two disks that's it.

RAID 5 on the other hand gotta do all of this calculations among three disks and parity them up. That's a lot of work, at least in comparison to a "simple" RAID 1 array.

Also as a side bar, QVO drives aren't ideal for loads like servicing VMs which usually drives' activities are usually at a premium. Neither are parity arrays like RAID 5 and so forth. You might wanna reconsider your deployment strategies with that said and the situation with RAID5 itself.

Solution 2:

I just tackled this issue too. I digged into raid5's driver and found that raid5 breaks the incoming discard request into 4k discard requests on the underlying devices. Moreover, it's been broken for quite a while, so that it practically ignores devices_handle_discard_safely. As a result, all 4k discards are done sync'ly to the underlying device, making it an even slower operation in total. sidenote: I'm bringing it up to LKML soon, so you can stay tuned there. I'm not familiar with any workaround available with existing kernels.