Why does linux sys fs modification work in plain docker but not under kubernetes?
The command being run inside the containers is:
echo never | tee /sys/kernel/mm/transparent_hugepage/enabled
Both containers run as privileged. But in the kubernetes docker container the command fails with error:
tee: /sys/kernel/mm/transparent_hugepage/enabled: Read-only file system
and under just plain docker run -it --privileged alpine /bin/sh
the command works fine.
I have used docker inspect
on both k8s and non-k8s containers to verify privileged status and don't see anything else listed that should cause this problem - I've run diff
between both outputs and then used docker run
with modifications to try and reproduce the problem in plain docker but failed (it stays working). Any idea why the kubernetes docker container fails and the plain docker container succeeds?
This is reproducible with the pod definition here:
apiVersion: v1
kind: Pod
metadata:
name: sys-fs-edit
spec:
containers:
- image: alpine
command:
- /bin/sh
args:
- -c
- echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
imagePullPolicy: Always
name: sysctl-buddy
securityContext:
privileged: true
Workaround
While I still don't know the cause for the discrepancy, the problem can be mitigated by remounting /sys as read-write.
apiVersion: v1
kind: Pod
metadata:
name: sys-fs-edit
spec:
containers:
- image: alpine
command:
- /bin/sh
args:
- -c
- echo never | tee /sys/kernel/mm/transparent_hugepage/enabled && sysctl -w net.core.somaxconn=8192 vm.overcommit_memory=1 && sleep 9999999d
imagePullPolicy: Always
name: sysctl-buddy
securityContext:
privileged: true
volumeMounts:
- mountPath: /sys
name: sys
readOnly: false
volumes:
- hostPath:
path: /sys
name: sys
On kubernetes it works a bit differently. Setting privileged: true
in a securityContext
of a container
is not enough to be able to modify any sysctl
of such container.
Take a look at this section of the official kubernetes docs that describes Using sysctls in a Kubernetes Cluster. As you can read here:
Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod
- must not have any influence on any other pod on the node
- must not allow to harm the node's health
- must not allow to gain CPU or memory resources outside of the resource limits of a pod.
By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:
kernel.shm_rmid_forced
,net.ipv4.ip_local_port_range
,net.ipv4.tcp_syncookies
,net.ipv4.ping_group_range
(since Kubernetes 1.18).
So in short, there are safe and unsafe sysctls. Most of them are considered as unsafe, even many of those which are namespaced. Unsafe sysctls need to be additionally enabled by the cluster admin on a node-by-node basis:
All safe sysctls are enabled by default.
All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:
kubelet --allowed-unsafe-sysctls \ 'kernel.msg*,net.core.somaxconn' ...
So you cannot simply set any sysctl arbitrarily even from a privileged
container running on your kubernetes cluster.