VM CPU usage at 100%
CPU usage on our metrics box is at 100% intermittently causing:
'Internal server error' when rendering Grafana dashboards
The only application running on our machine is Docker with 3 subcontainers
- cadvisor
graphite
grafana
Machine spec
OS Version Ubuntu 16.04 LTS
Release 16.04 (xenial)
Kernel Version 4.4.0-103-generic
Docker Version 17.09.0-ce
CPU 4 cores
Memory 4096 MB
Memory reservation is unlimited
Network adapter mgnt
Storage
Driver overlay2
Backing Filesystem extfs
Supports d_type true
Native Overlay Diff true
Memory swap limit is 2.00GB
Here is a snippet from cAdvisor:
The kworker and ksoftirqd processes change status constently from 'D' to 'R' to 'S'
Are the machine specs correct for this setup?
How can I get the CPU usage to 'normal' levels?
EDIT
After increasing memory from 4GB to 8GB it worked as expected for a few days but over time the CPU usage increased:
You have 4 ksoftirqds using 39% of your CPU. That is rather high, and can indicate a number of issues such as high I/O load, problems with power management, or kernel/device driver bugs.
Try updating to the latest kernel, making sure you have the appropriate variant selected (e.g. there are Ubuntu kernels specifically tuned for AWS & Azure), and have a look into some of the Linux I/O performance troubleshooting tools.
A great resource on Linux performance troubleshooting generally is Brendan Gregg's blog
It looks like the kernel is using tons of CPU kworker
threads which is often caused by a buggy kernel driver.
To debug, trigger a backtrace with echo l > /proc/sysrq-trigger
which will cause output to be generated in dmesg
. Run it a few times to see if it's consistent. Based on this thread it might be obvious which driver is causing the high load. One thought is if you're running this on ESXi the e1000 network interface driver is notoriously buggy.