cgroup2: devices filtering cleanup #2951

cyphar · 2021-05-12T14:39:37Z

This is my attempt at solving both #2797 and #2366. I'm not entirely sure if the behaviour I've implemented if a cgroup has more than one filter installed (detach all old filters, so that only the new one is effective) is quite correct, but it's necessary to fix the program leak problem (outside of switching to BPF pins entirely).

NOTE: I wanted to add some unit tests for our devicefilter code, but unfortunately BPF_PROG_TEST_RUN doesn't support cgroup programs at the moment. I would have to push some code upstream for that to happen (though I'll look into that next week).

Changelog entry:

cgroup2: devices: rework the filter generation to produce consistent results with cgroupv1, and always clobber any existing eBPF program(s) to fix `runc update` and avoid leaking eBPF programs (resulting in errors when managing containers).

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

libcontainer/cgroups/ebpf/devicefilter/devicefilter.go

kolyshkin · 2021-05-12T17:02:43Z

A typo in the second commit message: s/speciifc/specific/.

libcontainer/cgroups/ebpf/ebpf.go

cyphar · 2021-05-13T08:39:58Z

We also can't test the runc update behaviour because it turns out runc update doesn't support cgroup device limits... I tried adding support for it, but the behaviour is a bit weird because of the default allowed devices... Hmmm....

kolyshkin · 2021-05-15T02:54:35Z

We also can't test the runc update behaviour because it turns out runc update doesn't support cgroup device limits... I tried adding support for it, but the behaviour is a bit weird because of the default allowed devices... Hmmm....

Yeah, update.go basically repeats specconv, a year ago I tried to reuse the specconv in update but got nowhere...

Maybe we can at least have a simple test case added to libcontainer/integration, which removes and re-adds a device allow rule say 100 times (and check that device access works as expected). Without this PR we should fail on 64th iteration due to eBPF progs limit reached.

Same for runc update -- we could have a stupid test calling runc update 100 times (without actually changing anything) but I am not sure if such test makes sense if we implement what's described in the previous paragraph.

kolyshkin

LGTM except for the missing integration test(s).

cyphar · 2021-05-15T03:09:26Z

Yeah looking at it I think we would need to rethink the "default allow" device list behaviour if we ever want to add devices support to runc update (otherwise users will accidentally brick their containers when using it). I'll cook up a test on Monday.

cyphar · 2021-05-18T07:26:38Z

Seems as though the Microsoft PPA is down?

tests/integration/update.bats

kolyshkin · 2021-05-18T20:44:27Z

CI restarted (looks like it's working today).

kolyshkin · 2021-05-18T20:46:40Z

CI restarted; GHA still has some issues :(

E: Failed to fetch https://packages.microsoft.com/ubuntu/20.04/prod/dists/focal/main/binary-amd64/Packages.bz2 File has unexpected size (73402 != 73160). Mirror sync in progress? [IP: 104.42.185.173 443]

cyphar · 2021-05-19T03:15:25Z

CI passes again now. 😸

kolyshkin

LGTM

kolyshkin · 2021-05-21T00:01:59Z

@AkihiroSuda @mrunalp PTAL

AkihiroSuda · 2021-05-21T02:07:20Z

tests/integration/update.bats

+	runc run -d --console-socket "$CONSOLE_SOCKET" test_update
+	[ "$status" -eq 0 ]
+
+	for new_limit in $(seq 300); do


seq might cause extra alloc compared to for ((i = 0; i < 300; i++)), but negligible

AkihiroSuda · 2021-05-21T02:08:25Z

tests/integration/update.bats

@@ -648,3 +648,29 @@ EOF
 	runc resume test_update
 	[ "$status" -eq 0 ]
 }
+
+@test "runc update replaces devices cgroup program" {
+	[[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup


The test is NOP in rootless, right? Can we have a comment line to explain the expected behavior?

It's also no-op for cgroupv1... Yeah I'll add a comment.

AkihiroSuda · 2021-05-21T02:09:29Z

libcontainer/cgroups/ebpf/ebpf.go

+				retries++
+				continue
+			}
+			return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno)


We may get EINTR wand want to retry?

bpf(2) is considered EINTR-safe. Can not quote any sources though :(

Seems that you can get EINTR from bpf (especially in Go thanks to green thread preemption). cilium/ebpf#24

This commit saying that it is only about BPF_PROG_TEST_RUN (and so I was wrong about bpf being EINTR-safe), and this is the only bpf call that cilium/ebpf wraps in retry-on-eintr loop.

In some other place (cilium/ebpf#207 (comment)) they deduce that bpf syscall with some other parameters could not return EINTR, so maybe this behavior is limited to BPF_PROG_TEST_RUN.

OTOH it won't hurt to add retry-on-EINTR loop, something like:

for { _, _, errno := unix.Syscall(unix.SYS_BPF, uintptr(unix.BPF_PROG_QUERY), uintptr(unsafe.Pointer(&query)), unsafe.Sizeof(query)) if errno != unix.EINTR { break } }

Found another case -- they retry on bpf(BPF_PROG_LOAD) but on EAGAIN not EINTR.

So I guess we don't need a EINTR handler here.

libcontainer/integration/exec_test.go

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

When running inside a Docker container, systemd is not available. The new TestFdLeaksSystemd forgot to include the relevant t.Skip section. Fixes: a7feb42 ("libct/int: add TestFdLeaksSystemd") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The devices cgroup emulator is also useful for removing unneeded rules as well as computing what the final default-allow state of the filter will be (allow-list or deny-list). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

There were several issues with the previous cgroupv2 devices filter generator implementation, stemming from the previous implementation using a few too many tricks to implement the correct cgroup behaviour (rules were handled in reverse order, with wildcards having particularly special interpretations). As a result, some slightly odd configurations with rules in specific orders could result in incorrect filters being generated. By switching to the emulator which is already used by cgroupv1, we can guarantee that the behaviour of filters in both cgroup versions will be identical, as well as making use of the hardenings in the emulator (not allowing users to add deny rules the kernel will ignore). (Note that because the ordering of the devices emulator rules is deterministic and based on the rule value, the existing test rules had to be reordered slightly.) Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

In the normal cases (only one existing filter or no existing filters), just make use of BPF_F_REPLACE if there is one existing filter. However if there is more than one filter applied, we should probably remove all other filters since the alternative is that we will never remove our old filters. The only two other viable ways of solving this problem would be to use BPF pins to either pin the eBPF program using a predictable name (so we can always only replace *our* programs) or to switch away from custom programs and instead use eBPF maps (which are pinned) and thus we just update the map conntents to update the ruleset. Unfortunately these both would add a hard requirement of bpffs and would require at least a minor rewrite of the eBPF filtering code -- which is better left for another time. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

This is to ensure that we aren't leaking eBPF programs after "runc update". Unfortunately we cannot directly test the behaviour of cgroup program updates in an integration test because "runc update" doesn't support that behaviour at the moment. So instead we rely on the fact that each "runc update" implicitly triggers the devices rules to be updated. Without the previous patches applied, this new test will fail with errors (on cgroupv2 systems). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin · 2021-05-23T20:57:51Z

libcontainer/cgroups/ebpf/devicefilter/devicefilter.go

@@ -11,6 +11,7 @@ import (
 	"strconv"

 	"github.com/cilium/ebpf/asm"
+	devicesemulator "github.com/opencontainers/runc/libcontainer/cgroups/devices"


nit: devicesEmulator

kolyshkin

LGTM

kolyshkin · 2021-05-24T02:53:05Z

libcontainer/cgroups/ebpf/ebpf.go

+				retries++
+				continue
+			}
+			return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno)


This commit saying that it is only about BPF_PROG_TEST_RUN (and so I was wrong about bpf being EINTR-safe), and this is the only bpf call that cilium/ebpf wraps in retry-on-eintr loop.

In some other place (cilium/ebpf#207 (comment)) they deduce that bpf syscall with some other parameters could not return EINTR, so maybe this behavior is limited to BPF_PROG_TEST_RUN.

OTOH it won't hurt to add retry-on-EINTR loop, something like:

for { _, _, errno := unix.Syscall(unix.SYS_BPF, uintptr(unix.BPF_PROG_QUERY), uintptr(unsafe.Pointer(&query)), unsafe.Sizeof(query)) if errno != unix.EINTR { break } }

kolyshkin · 2021-05-24T03:01:29Z

libcontainer/cgroups/ebpf/ebpf.go

+				retries++
+				continue
+			}
+			return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno)


Found another case -- they retry on bpf(BPF_PROG_LOAD) but on EAGAIN not EINTR.

So I guess we don't need a EINTR handler here.