Samba and luks encrypted disk together: Huge performance loss despite plenty of CPU resources, LUKS and samba alone works as expected

I've dug deeper into it.

It seems like a wrong display of CPU usage.

Reading with samba from unecrypted drive giving 112 MB/s requiring about 38 % CPU usage on whole system About 38 % CPU usage

CPU usage is floating between 29 % and even sometimes goes shortly up to 94 % while reading from unencrypted drive.

Now taking encrypting read performance of 110 MB/s reduced by 38 % gives 68,2 MB/s. Thats quite close to the 64 MB/s.

So from a logical point of view: Samba itself requires relatively much CPU and in combination with encryption the resulting speed seems to make sense now.

BTW: System done these tests on is a Rasperry PI 400 with 4 core arm CPU @ default clock of 1,8 GHz. cryptsetup benchmark reports for aes-xts with 512 bits key (so 256 bit AES encryption) 77 MB/s for encryption and 66,9 MB/s for decryption. However cryptsetup does these tests with only one CPU utilized, so I guess powermanagement clocks down CPU thats why with real encryption and decryption there is much more performance like dd shows.

I've also done some other performance tests.

I've also increased read ahead size both on /dev/mapper and /dev/sdd from 256 to 65536 via sudo blockdev --setra 65536 /dev/sdd and sudo blockdev --setra 65536 /dev/mapper/sdd_crypt however these did not make any noticeable difference.

Digging still deeper into it I found this very interesting article https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Their research lead to no_read_workqueue and no_write_workqueue beginning with Kernel version 5.9. Luckily current Rasperry PI OS is on 5.10.11-v7l+, so dmcrypt supporting these options.

However latest cryptsetup version 2.1.0 on Raspberry PI OS Buster don't support these options. So I've compiled cryptsetup 2.3.4 to use no_read_workqueue and no_write_workqueue (see https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-crypt.html) and mounted via

cryptsetup open --perf-no_read_workqueue --perf-no_write_workqueue /dev/sdd sdd_crypt

however performance was massively reduced on this particular setup reading from device and not RAM disk.

In conclusion: Since the resulting speeds are plausible it seems like a wrong display of CPU Usage.