How do I troubleshoot a disk IO performance issue possibly related to dm-crypt/LUKS?
Issue
I recently installed Ubuntu 16.04 LTS (kernel 4.8.0-52) on a Lenovo T460p with an i7-6820HQ, 32GB of RAM, and a 512GB Micron 1100 SSD. I checked the full disk encryption box during the installation and used the default partitioning layout. In general, performance is great.
However, over time my builds started running taking about twice as long. Further, during parts of the build that write large files any (non-build) task that requires disk I/O ends up waiting a lot. This includes launching new programs, loading pages in Firefox, etc. In Firefox, for example, I can navigate the UI, switch tabs and everything is fine. But if I follow a link the whole UI locks up until things quiet down.
So in summary, after some period of time, builds suddenly take longer and at certain points during the build the computer is basically unusable.
What can I do to try and diagnose or resolve this issue?
Troubleshooting Info
Don't reboot often so the system is often up for several days before I run into this issue. Once I hit it, I flail for a bit trying to figure out the issue, then reboot so I can keep working.
The only thing that resolves the issue is rebooting the machine. I've tried exiting all applications, logging out and back in, and dropping the buffer cache (flail theory that as it used memory space disk syncs were happening more frequently) but only rebooting works.
As a long shot, I tried the solution to this answer but there was no change in behavior.
Running
iotop
shows thedmcrypt_write
thread using 99% I/O whenever I'm experiencing the issues. When I'm not experiencing the issue, I also seedmcrypt_write
pop to the top with a relatively high IO % but it doesn't stay there very long.If I run
dd if=/dev/urandom of=$HOME/bigfile bs=10k count=200k; sync
when things are working normally,dmcrypt_write
will jump to the top for a second or two but it's no where near the same duration as during one of my builds.A full build generates about 1.4 GB of data. It's a Java project with several modules. So, lots of little files are created plus some larger JAR files that aggregate all those little files.
There is always plenty of memory available and the swap partition is not being used.
I have coworkers with similar computers (T460p) also running Ubuntu that are not experiencing this issue. They they all seem to have different SSD brand/models, though.
Update
The issue just surfaced again so I did some more testing based on the reply to this question.
- The file system is still not mounted with the
discard
option so I instead ranfstrim
assuming that would be somewhat similar to having had thediscard
option enabled - I didn't do enough timing when the issue first happened, but after running
fstrim
, build speeds seemed to be back to normal... but after the build completes, thedmcrypt_write
thread kicks in and makes the system unusable for a period of time. All and all the total time to build and for the system to become usable seems to be about the same as before. - I changed
/proc/sys/vm/dirty_ratio
to 2 and/proc/sys/vm/dirty_background_ratio
to 1 and ran some builds. The builds took longer than normal—about the same as the last time I hit this issue, but the system didn't seem to lock up as much. Changing it back to 20 and 10 reverted to the behavior mentioned above. - On a clean boot, I tried setting
/proc/sys/vm/dirty_ratio
to 2 and/proc/sys/vm/dirty_background_ratio
to 1 and the time was comparable with it at 20 and 10.
I have exactly the same problem as you, and a quick research showed this post:
https://blog.cloudflare.com/speeding-up-linux-disk-encryption
The CloudFlare employee did quite some digging through the Linux source code and concluded that the culprit is the current dmcrypt
implementation. He solved the problem by basically rewriting the corresponding part of the kernel.
So AFAIK the only two ways to get rid of slowdowns are (1) to compile his version of the kernel, or (2) reboot once in a while (as you said). I chose the latter.
Don't know about LUKS specifically, but for general IO issues on an SSD make sure discard is on for your fs mount, i.e. grep discard /proc/mounts also might try (as root) "echo 1 >> /proc/sys/vm/dirty_background_ratio; echo 2 >> /proc/sys/vm/dirty_ratio", this will get the system to initiate IO sooner when there is less of a back log of data to write out.
I had a similar issue on Debian 10+11 where, if I did large writing operations on the LUKS-disk, my whole system would freeze up for some time, then respond again, then freeze up again...
I didnt have to reboot though - just wait till the writing operation was done.
As ScumCoder wrote there's a fix available. As of kernel 5.9 the fix is integrated into the kernel.
The following command fixed it for me:
cryptsetup --allow-discards --perf-no_read_workqueue --perf-no_write_workqueue --persistent refresh nvme0n1p3_crypt
I extracted my disk-name "nvme0n1p3_crypt" by using the command dmsetup table
I got inspiration from https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disable_workqueue_for_increased_solid_state_drive_(SSD)_performance
After the fix my writing operations are a lot faster.