How can I override IRQ affinity for NVME devices
I am trying to move all interrupts over to cores 0-3 to keep the rest of my cores free for high speed, low latency virtualization.
I wrote a quick script to set IRQ affinity to 0-3:
#!/bin/bash
while IFS= read -r LINE; do
echo "0-3 -> \"$LINE\""
sudo bash -c "echo 0-3 > \"$LINE\""
done <<< "$(find /proc/irq/ -name smp_affinity_list)"
This appears to work for USB devices and network devices, but not NVME devices. They all produce this error:
bash: line 1: echo: write error: Input/output error
And they stubbornly continue to produce interrupts evenly across almost all my cores.
If I check the current affinities of those devices:
$ cat /proc/irq/81/smp_affinity_list
0-1,16-17
$ cat /proc/irq/82/smp_affinity_list
2-3,18-19
$ cat /proc/irq/83/smp_affinity_list
4-5,20-21
$ cat /proc/irq/84/smp_affinity_list
6-7,22-23
...
It appears "something" is taking full control of spreading IRQs across cores and not letting me change it.
It is completely critical that I move these to other cores, as I'm doing heavy IO in virtual machines on these cores and the NVME drives are producing a crap load of interrupts. This isn't Windows, I'm supposed to be able to decide what my machine does.
What is controlling IRQ affinity for these devices and how do I override it?
I am using a Ryzen 3950X CPU on a Gigabyte Auros X570 Master motherboard with 3 NVME drives connected to the M.2 ports on the motherboard.
(Update: I am now using a 5950X, still having the exact same issue)
Kernel: 5.12.2-arch1-1
Output of lspci -v
related to NVME:
01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
Subsystem: Phison Electronics Corporation E12 NVMe Controller
Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 14
Memory at fc100000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [f8] Power Management version 3
Capabilities: [100] Latency Tolerance Reporting
Capabilities: [110] L1 PM Substates
Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
Capabilities: [200] Advanced Error Reporting
Capabilities: [300] Secondary PCI Express
Kernel driver in use: nvme
04:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
Subsystem: Phison Electronics Corporation E12 NVMe Controller
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0, IOMMU group 25
Memory at fbd00000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [f8] Power Management version 3
Capabilities: [100] Latency Tolerance Reporting
Capabilities: [110] L1 PM Substates
Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
Capabilities: [200] Advanced Error Reporting
Capabilities: [300] Secondary PCI Express
Kernel driver in use: nvme
05:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
Subsystem: Phison Electronics Corporation E12 NVMe Controller
Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0, IOMMU group 26
Memory at fbc00000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [f8] Power Management version 3
Capabilities: [100] Latency Tolerance Reporting
Capabilities: [110] L1 PM Substates
Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
Capabilities: [200] Advanced Error Reporting
Capabilities: [300] Secondary PCI Express
Kernel driver in use: nvme
$ dmesg | grep -i nvme
[ 2.042888] nvme nvme0: pci function 0000:01:00.0
[ 2.042912] nvme nvme1: pci function 0000:04:00.0
[ 2.042941] nvme nvme2: pci function 0000:05:00.0
[ 2.048103] nvme nvme0: missing or invalid SUBNQN field.
[ 2.048109] nvme nvme2: missing or invalid SUBNQN field.
[ 2.048109] nvme nvme1: missing or invalid SUBNQN field.
[ 2.048112] nvme nvme0: Shutdown timeout set to 10 seconds
[ 2.048120] nvme nvme1: Shutdown timeout set to 10 seconds
[ 2.048127] nvme nvme2: Shutdown timeout set to 10 seconds
[ 2.049578] nvme nvme0: 8/0/0 default/read/poll queues
[ 2.049668] nvme nvme1: 8/0/0 default/read/poll queues
[ 2.049716] nvme nvme2: 8/0/0 default/read/poll queues
[ 2.051211] nvme1n1: p1
[ 2.051260] nvme2n1: p1
[ 2.051577] nvme0n1: p1 p2
Solution 1:
What is controlling IRQ affinity for these devices?
Linux kernel since v4.8 is automatically using MSI/MSI-X interrupt masking in NVMe drivers; and with IRQD_AFFINITY_MANAGED
, automatically manages MSI/MSI-X interrupts in kernel.
See these commits:
-
90c9712fbb388077b5e53069cae43f1acbb0102a
- NVMe: Always use MSI/MSI-X interrupts -
9c2555835bb3d34dfac52a0be943dcc4bedd650f
- genirq: IntroduceIRQD_AFFINITY_MANAGED
flag
Seeing your kernel version and your devices capabilities via lspci -v
output, apparently it is the case.
and how do I override it?
Besides disabling the flags and recompiling the kernel, probably disable MSI/MSI-X to your PCI bridge (instead of devices):
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
Note that there will be performance impact on disabling MSI/MSI-X. See this kernel documentation for more details.
Instead of disabling MSI/MSI-X, a better approach would be keeping MSI-X but also enable polling mode in NVMe driver. See Andrew H's answer.
Solution 2:
The simplest solution to this problem is probably just to switch from using IRQ/interrupt mode to polling mode for the NVMe driver.
Add this to /etc/modprobe.d/nvme.conf
:
options nvme poll_queues=4
then run update-initramfs -u
, reboot, and you should see a vast reduction in IRQs for NVMe devices. You can also play around with the poll queue count in sysfs and other NVMe driver tweakables (modinfo NVMe
should give you a list of params you can adjust)
That said, this is all highly dependent on what kernel version you’re running…
Solution 3:
That is intentional.
NVMe devices are supposed to have multiple command queues with associated interrupts, so interrupts can be delivered to the CPU that requested the operation.
For an emulated virtual disk, this is the CPU running the I/O thread, which then decides if the VM CPU needs to be interrupted to deliver the emulated interrupt.
For a PCIe passthrough disk, this is the VM CPU, which leaves the VM, enters the host interrupt handler, which notices that the interrupt is destined for the virtual CPU, and enqueues it so it is delivered to the VM on the VM enter after the handler returns, so we still get only one interruption of the VM context.
This is pretty much as optimal as it gets. You can pessimize this by delivering the IRQ to another CPU that will then notice that the VM needs to be interrupted, and queue an inter-processor interrupt to direct it where it needs to go.
For I/O that does not belong to a VM, the interrupt should go to a CPU that is not associated with a VM.
For this to work properly, the CPU mapping for the VMs needs to be somewhat static.
There is also the CPU isolation framework you could take a look at, but that is probably too heavy-handed.