Skip to content

Commit

Permalink
entrypoint: fix chicken-and-egg runtime problem
Browse files Browse the repository at this point in the history
In case the runtime used to run the KIND container is not aware of some
cgroup subsystems, those will be exposed to the container without proper
scoping (note the rdma and misc):

	kir@ubu2110:~$ sudo docker run -i --rm --privileged ubuntu sh -xc 'cat /proc/self/cgroup; grep cgroup /proc/self/mountinfo'
	+ cat /proc/self/cgroup
	13:pids:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	12:net_cls,net_prio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	11:hugetlb:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	10:misc:/
	9:freezer:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	8:devices:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	7:cpu,cpuacct:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	6:perf_event:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	5:memory:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	4:blkio:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	3:rdma:/
	2:cpuset:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	1:name=systemd:/docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b
	0::/system.slice/containerd.service
	+ grep cgroup /proc/self/mountinfo
	666 665 0:65 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,mode=755,inode64
	667 666 0:32 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,xattr,name=systemd
	668 666 0:35 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,cpuset
	669 666 0:36 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,rdma
	670 666 0:37 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,blkio
	671 666 0:38 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,memory
	672 666 0:39 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event
	673 666 0:40 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,cpu,cpuacct
	674 666 0:41 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,devices
	675 666 0:42 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,freezer
	676 666 0:43 / /sys/fs/cgroup/misc rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,misc
	677 666 0:44 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:24 - cgroup cgroup rw,hugetlb
	678 666 0:45 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,net_cls,net_prio
	679 666 0:46 /docker/c1f3fc37b0d6e5a109c62e861feb4d6fd4ef381bf5a9576e5e7c56da4eca841b /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:26 - cgroup cgroup rw,pids

Now, if a newer runtime (the one that is aware of e.g. rdma subsystem)
will be used inside this container, it may create cgroups under those
subsystems. Since those are not properly scoped, they will leak to the
host and thus will become non-removable (EBUSY on rmdir).

The workaround, as implemented here, is to hide (unmount and remove)
such unscoped subsystemd.

Fixes kubernetes/kubernetes#109182

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
  • Loading branch information
kolyshkin committed Apr 13, 2022
1 parent 1282325 commit db40a9b
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions images/base/files/usr/local/bin/entrypoint
Expand Up @@ -217,6 +217,28 @@ fix_cgroup() {
current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3)
local cgroup_subsystems
cgroup_subsystems=$(findmnt -lun -o source,target -t cgroup | grep "${current_cgroup}" | awk '{print $2}')
# Unmount the cgroup subsystems that are not known to runtime used to
# run the container we are in. Those subsystems are not properly scoped
# (i.e. the root cgroup is exposed, rather than something like docker/xxxx).
# In case a runtime (which is aware of more subsystems -- such as rdma,
# misc, or unified) is used inside the container, it may create cgroups for
# these subsystems, and as they are not scoped, they will leak to the host
# and thus will become non-removable.
#
# See https://github.com/kubernetes/kubernetes/issues/109182
local unsupported_cgroups
unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep -v "${current_cgroup}" | awk '{print $2}')
if [ -n "$unsupported_cgroups" ]; then
local mnt
echo "$unsupported_cgroups" |
while IFS= read -r mnt; do
echo "INFO: unmounting and removing $mnt"
umount "$mnt" || true
rmdir "$mnt" || true
done
fi


# For each cgroup subsystem, Docker does a bind mount from the current
# cgroup to the root of the cgroup subsystem. For instance:
# /sys/fs/cgroup/memory/docker/<cid> -> /sys/fs/cgroup/memory
Expand Down

0 comments on commit db40a9b

Please sign in to comment.