Server suddenly has high softirq cpu use after reboot

A 48 GB RAM virtual server maintaining about 25k TCP connections (devices in the field logging in to set up an SSH tunnel) ran out of RAM and started swapping, and getting slow, etc. We upgraded and rebooted. Even after the 25k connections had been restored and the initial DDOS storm dealt with, the server now showed an enormous amount of softirq use. How do I find the cause?

Here you can see the events:

Munin CPU graph

It's striking that there didn't used to be a lot of softirq. Now, there are 8 kernel threads doing about 60% CPU of it (ksoftirqd threads).

Looking at Munin graphs, I see the interrupts of PCI-MSI 49153-edge virtio0-input.0 has increased a lot (mind the log y scale):

Munin interrupts graphs

The amount of network traffic the machine has to deal with didn't really change.

I wrote a quick python script that shows the interrupts per second, every second, from /proc/interrupts from PCI-MSI 49153-edge virtio0-input.0, and it mostly is about 50-100 per second, but every once in a while, there is a 5000 - 10000 burst.

Because in the upgrade process, the VM hoster's control panel announced that it needed to migrate the VM to another server. I theorized that that server has a different Ethernet controller, differently emulated interrupt controller, or whatever, but they even migrated the VM back, and there's no difference.

Another difference is that the VM went from vmlinuz-4.15.0-45-generic to /boot/vmlinuz-4.15.0-72-generic. With all the Intel CPU patches lately, I can imagine something snuck in there.

The big question is, how do I get to the root cause, or get more info where these interrupts come from? Rebooting the server to the old kernel is possible, but not desirable.


It turned out somebody installed atop, which has a systemd service that collections process accounting information. Removing it fixed it.