systemd process with uncommon CPU usage on Kubernetes cluster

I'm running a single master/node Kubernetes cluster in a CentOS 7 virtual machine, and I realized that the systemd process (as PID 1) is constantly making using of CPU.

[root@ip-10-0-0-66 ~]# ps aux | head -n2
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  7.7  0.0 129088  7720 ?        Ss   Jun21 639:41 /usr/lib/systemd/systemd --switched-root --system --deserialize 22

Furthermore, systemd is generating millions of log lines like this:

[root@ip-10-0-0-66 ~]# tail -n 10 /var/log/messages
Jun 27 12:49:14 ip-10-0-0-66 systemd: Created slice libcontainer_6148_systemd_test_default.slice.
Jun 27 12:49:14 ip-10-0-0-66 systemd: Removed slice libcontainer_6148_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Created slice libcontainer_6155_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Removed slice libcontainer_6155_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Created slice libcontainer_6155_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Removed slice libcontainer_6155_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Created slice libcontainer_6162_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Removed slice libcontainer_6162_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Created slice libcontainer_6162_systemd_test_default.slice.
Jun 27 12:49:15 ip-10-0-0-66 systemd: Removed slice libcontainer_6162_systemd_test_default.slice.

There are almost 50 log lines registered per second, which are flooding the /var/logs/messages file:

[root@ip-10-0-0-66 ~]# sudo wc -l /var/log/messages
5992826 /var/log/messages
[root@ip-10-0-0-66 ~]# sudo cat /var/log/messages | grep 'systemd_test_default' | wc -l
5987033

Eventually, the kubelet process also register errors like this one:

Jun 27 12:53:37 ip-10-0-0-66 systemd: Removed slice libcontainer_29206_systemd_test_default.slice.
Jun 27 12:53:37 ip-10-0-0-66 systemd: Created slice libcontainer_29206_systemd_test_default.slice.
Jun 27 12:53:37 ip-10-0-0-66 systemd: Removed slice libcontainer_29206_systemd_test_default.slice.
Jun 27 12:53:37 ip-10-0-0-66 kubelet: W0627 12:53:37.447052    5352 watcher.go:87] Error while processing event ("/sys/fs/cgroup/memory/libcontainer_29206_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent: no such file or directory
Jun 27 12:53:37 ip-10-0-0-66 kubelet: W0627 12:53:37.447117    5352 watcher.go:87] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_29206_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): open /sys/fs/cgroup/devices/libcontainer_29206_systemd_test_default.slice: no such file or directory
Jun 27 12:53:37 ip-10-0-0-66 systemd: Created slice libcontainer_29225_systemd_test_default.slice.
Jun 27 12:53:37 ip-10-0-0-66 systemd: Removed slice libcontainer_29225_systemd_test_default.slice.
Jun 27 12:53:37 ip-10-0-0-66 systemd: Created slice libcontainer_29232_systemd_test_default.slice.
Jun 27 12:53:37 ip-10-0-0-66 systemd: Removed slice libcontainer_29232_systemd_test_default.slice.

How can I figure out what is causing this systemd CPU usage and log messages?


Additional info

Versions:

  • OS: CentOS Linux release 7.6.1810 (Core)
  • Kernel: 3.10.0-957.21.2.el7.x86_64
  • Kubernetes: v1.15.0
  • Docker: 18.09.5 CE
  • systemd: 219

Cgroup driver config:

$ docker info | grep -i cgroup
Cgroup Driver: systemd

$ cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--cgroup-driver=systemd"

This is a known bug currently in kubernetes's kubelet process and isn't limited to only CentOS based systems, but any Linux (including Ubuntu), where you're using systemd as the cgroup-driver for kubelet. It seems to started showing up in the wild after version 1.14, but may not simply not have been as prevalent an issue before 1.14 as that's around when the official recommendation from the kubernetes docs recommends using systemd as the cgroup driver), for the following reason:

When systemd is chosen as the init system for a Linux distribution, the init process generates and consumes a root control group (cgroup) and acts as a cgroup manager. Systemd has a tight integration with cgroups and will allocate cgroups per process. It’s possible to configure your container runtime and the kubelet to use cgroupfs. Using cgroupfs alongside systemd means that there will then be two different cgroup managers.

Control groups are used to constrain resources that are allocated to processes. A single cgroup manager will simplify the view of what resources are being allocated and will by default have a more consistent view of the available and in-use resources. When we have two managers we end up with two views of those resources. We have seen cases in the field where nodes that are configured to use cgroupfs for the kubelet and Docker, and systemd for the rest of the processes running on the node becomes unstable under resource pressure.

Changing the settings such that your container runtime and kubelet use systemd as the cgroup driver stabilized the system. Please note the native.cgroupdriver=systemd option in the Docker setup below.

source: https://kubernetes.io/docs/setup/cri/

Prior to then the other cgroup-driver cgroupfs seems to have been the accepted/default approach. In fact, I only switched over because the recommendation started to appear during kubeadm init'ing a new 1.14.x cluster a few months back, which lead me to find this github issue on exactly this situation.

Basically, it seems to be an unreliable interaction between kubelet handling of systemd and cAdvisor probing, so perhaps the code is not quite ready for prime-time yet. The full gorey technical explaination is available in this comment:

The 'missing' slice is created by runc's systemd code. This is why the error is only seen when systemd is configured as the cgroup manager.

The error comes from when cadvisor starts trying to collect and process events from the newly created 'slice/container'.

Presumably there is a race condition here, whereby cadvisor is unaware that the 'slice/container' it is trying to start an event watcher for has been deleted by runc.

The issue has been open since April without much signs of it being tackled (as it seems low priority).

The last commit to have touched this code is this one, however that looks like mainly what it changed was the file naming/directory structure and layout of the code, and the cadvisor code was introduced a long time before.

Finally, while switching to use cgroupfs is an option (as @hanx commented), it may lead to WORSE problems (instability under load) and again is NOT recommended by the official docs. Some people opt for that route still, just to get rid of the (mostly benign) error messages.