How do I mount a private /proc inside a namespace inside a docker container?
Solution 1:
This command works:
sudo docker run --cap-add=sys_admin --security-opt label:disable -it fedora:rawhide /bin/sh -c 'for dir in $(awk '"'"'/\/proc\// { print $5; }'"'"' /proc/1/mountinfo ); do umount "$dir"; done; /usr/bin/unshare -Ufmp -r /bin/sh -c '"'"'mount --make-private / ; mount -t proc proc /proc ; ls /proc'"'"
I didn't split it over multiple lines because the quoting is really important. Basically, it unmounts a whole bunch of stuff in /proc
before running unshare and mounting /proc
in the child user namespace.
Docker mounts over a bunch of directories and files in /proc
with its own directories that are empty tmpfs directories and null files. Various files in /proc
represent values that are applicable to the whole system. In fact, /proc/kcore
would allow you to read kernel memory inside the container if you were root, which, since a lot of people want to believe that containers are some kind of lightweight VM or something, would surprise a lot of people.
The kernel in (as of version 4.14 anyway) fs/namespace.c:mnt_already_visible
checks to see if you're mounting an already mounted filesystem, and if that filesystem has things mounted as child filesystems and those mounts have the MNT_LOCKED flag, it fails. The MNT_LOCKED flag seems to be applied (I didn't hunt down where this is in the kernel) to all mounts whenever you create a user namespace in order to prevent you from unmounting things in that namespace (because you get privileges 'within' the user namespace) and making hidden stuff visible again.
The command I posted uses an awk script on the contents of /proc/1/mountinfo
to pull out all of the subdirectories of and files in /proc
that Docker has mounted over, and unmounts them all. This makes the /proc
filesystem mountable in nested user namespaces again.