Ubuntu on VPS becomes unresponsive: BUG: soft lockup - CPU#0 stuck for 22s
We have a VPS running Ubuntu, on Xen. The problem is this, about once a day, for about 20-50 minutes, at a random time, the server becomes completely unresponsive to the outside world. After this period, it becomes responsive again, as if nothing had happened, it doesn't lose uptime, it doesn't restart. It just starts responding again as if it had been in suspended animation.
These outages occur under conditions of non-exceptional memory and cpu, for example 70% mem, 5% cpu. I have stopped all non-essential services so the usage is very even. These outages don't particularly occur during times of increased memory/cpu (during daily tasks), they sometimes occur at times of very low cpu use (<2%), but in the past also occured during swapping.
These blackouts have been occurring both under Ubuntu 12.04 LTS, and Ubuntu 14.04 LTS - no change at all (I upgraded Ubuntu specifically to see if it helped this problem).
It is possible to log into our webhosts site, and use their administration console to see error messages from during this time. Presumably, these messages are from the Xen virtualization, the main message goes like this:
BUG: soft lockp - CPU#0 stuck for 22s! [ksoftireqd/0:3] (repeats many times)
SysRq : Emergency Sync (Sometimes this is the only message in the console)
Others seen previously under different load situations include:
BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
(repeated many times) or:
INFO: rcu_sched detected stall on CPU 0 (t=15000 jiffies)
(repeated many times with t getting bigger)
From googling around I've tried various kernel parameters such as nohz=off and acpi=off to no avail. All tech support has said is that other Ubuntu installations are not suffering the same problem.
Anyone got any ideas or experience with this problem?
Solution 1:
Well I couldn't find any solution to this problem whatever I tried. In the end I replaced Ubuntu with Debian 7.0, and the problem went away, along with some anomalous CPU usage which didn't show up in top but did show up in the VPS monitoring panel (this CPU usage manifested as gradual increase over 2-3 days up to 10%, followed by dropping back to 0%, resulting in a 'sawtooth' pattern on the CPU usage graph). I did not try re-installing Ubuntu (although I did try upgrading to 14.04), because of this I cannot say for sure that replacing Ubuntu with Debian was the solution. Nevertheless, Debian has been as rock-solid as one would expect from its reputation, sadly, I can say the same for Ubuntu meeting it's reputation. I love Ubuntu and I absolutely love Unity but it appears Ubuntu really isn't stable on as wide a range of hardware.
I have answered my own question because 1) I did find a solution and 2) I couldn't find a solution anywhere else (except in the case of CentOS, downgrading CentOS 6 to CentOS 5) so may this be useful if perhaps not welcome to others with this problem. I know I wouldn't be happy with the solution: Replace Ubuntu with Debian! But in the end it's what I did to fix the issue. Incidentally I settled on Debian because I found no reports of this problem for Debian, while I found reports of this problem for Ubuntu and CentOS.
Solution 2:
Hope this helps anyone looking at this problem in the future.
We've experienced this issue in a similar environment:
- Ubuntu 14.04 3.13.0 Kernel
- QEMU KVM environment
Our Splunk cluster master was issuing these warnings on average every five minutes. CPU load would go up to 35% routinely, and the warnings would list splunkd or python as the process most likely to have caused the lock.
After much hair pulling and so gnashing of teeth, in desperation we changed the disk bus setting in Virt-Manager from 'virtio' to 'SATA'.
The problem went away.
At the moment we are still monitoring the system, but it hasn't issued any more warnings since the change (half an hour so far) and CPU load is stable at around 2%.
I know that it's a little early to break out the champagne and fireworks, but we are hopeful.