Systemd fails to run in a docker container when using cgroupv2 (--cgroupns=private)

Solution 1:

tl;dr

It seems to me that this use case is not explicitly supported yet. You can almost get it working but not quite.

The root cause

When systemd sees a unified cgroupfs at /sys/fs/cgroup it assumes it should be able to write to it which normally should be possible but is not the case here.

The basics

First of all, you need to create a systemd slice for docker containers and tell docker to use it - my current docker/daemon.json:

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "features": { "buildkit": true },
  "experimental": true,
  "cgroup-parent": "docker.slice"
}

Note: Not all of these options are necessary. The most important one is cgroup-parent. The cgroupdriver should already be switched to "systemd' by default.

Each slice gets its own nested cgroup. There is one caveat though: Each group might only be a "leaf" or "intermediary". Once a process takes ownershop of a cgroup no other can manage it. This means that the actual container process needs and will get its own private group attached below the configured one in the form of a systemd scope.

Reference: Please find more about systemd resource control, handling of cgroup namespaces and delegation.

Note: A this point docker daemon should use --cgroupns private by default, but you can force it anyway.

Now a newly started container will get its own group which should be available in a path that (depending on your setup) resembles:

/sys/fs/cgroup/your_docker_parent.slice/your_container.scope

And here is the important part: You must not mount a volume into container's /sys/fs/cgroup. The path to its private group mentioned above should get mounted there automatically.

The goal

Now, in theory, the container should be able to manage this delegated, private group by itself almost fully. This would allow its own init process to create child groups.

The problem

The problem is that the /sys/fs/cgroup path in the container gets mounted read-only. I've checked apparmor rules and switched seccomp to unconfined to no avail.

The hypothesis

I am not completely certain yet - my current hypothesis is that this is a security feature of docker/moby/containerd. Without private groups it makes perfect sense to mount this path ro.

Potential solutions

What I've also discovered is that enabling user namespace remapping causes the private /sys/fs/cgroup to be mounted with rw as expected!

This is far from perfect though - the cgroup (among others) mount has wrong ownership: it's owned by the real system root (UID0) while the container has been remapped to a completely different user. Once I've manually adjusted the owner - the container was able to start a systemd init sucessfully.

I suspect this is a deficiency of docker's userns remapping feature and might be fixed sooner or later. Keep in mind that I might be wrong about this - I did not confirm.

Discussion

Userns remapping has got a lot of drawbacks and the best possible scenario for me would be to get the cgroupfs mounted rw without it. I still don't know if this is done on purpose or if it's some kind of limitation of the cgroup/userns implementation.

Notes

It's not enough that your kernel has cgroupv2 enabled. Depending on the linux distribution bundled systemd might prefer to use v1 by default.

You can tell systemd to use cgroupv2 via kernel cmdline parameter:
systemd.unified_cgroup_hierarchy=1

It might also be needed to explictly disable hybrid cgroupv1 support to avoid problems using: systemd.legacy_systemd_cgroup_controller=0

Or completely disable cgroupv1 in the kernel with: cgroup_no_v1=all

Solution 2:

For those wondering how to solve this with the kernel commandline:

# echo 'GRUB_CMDLINE_LINUX=systemd.unified_cgroup_hierarchy=false' > /etc/default/grub.d/cgroup.cfg
# update-grub

This creates a "hybrid" cgroup setup, which makes the host cgroup v1 available again for the container's systemd.

https://github.com/systemd/systemd/issues/13477#issuecomment-528113009