Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup2: devices filtering cleanup #2951

Merged
merged 6 commits into from May 24, 2021
Merged

cgroup2: devices filtering cleanup #2951

merged 6 commits into from May 24, 2021

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented May 12, 2021

This is my attempt at solving both #2797 and #2366. I'm not entirely sure if the behaviour I've implemented if a cgroup has more than one filter installed (detach all old filters, so that only the new one is effective) is quite correct, but it's necessary to fix the program leak problem (outside of switching to BPF pins entirely).

NOTE: I wanted to add some unit tests for our devicefilter code, but unfortunately BPF_PROG_TEST_RUN doesn't support cgroup programs at the moment. I would have to push some code upstream for that to happen (though I'll look into that next week).


Changelog entry:

cgroup2: devices: rework the filter generation to produce consistent results with cgroupv1, and always clobber any existing eBPF program(s) to fix `runc update` and avoid leaking eBPF programs (resulting in errors when managing containers). 

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@kolyshkin
Copy link
Contributor

A typo in the second commit message: s/speciifc/specific/.

@cyphar
Copy link
Member Author

cyphar commented May 13, 2021

We also can't test the runc update behaviour because it turns out runc update doesn't support cgroup device limits... I tried adding support for it, but the behaviour is a bit weird because of the default allowed devices... Hmmm....

@cyphar cyphar requested a review from kolyshkin May 15, 2021 02:34
@kolyshkin
Copy link
Contributor

We also can't test the runc update behaviour because it turns out runc update doesn't support cgroup device limits... I tried adding support for it, but the behaviour is a bit weird because of the default allowed devices... Hmmm....

Yeah, update.go basically repeats specconv, a year ago I tried to reuse the specconv in update but got nowhere...

Maybe we can at least have a simple test case added to libcontainer/integration, which removes and re-adds a device allow rule say 100 times (and check that device access works as expected). Without this PR we should fail on 64th iteration due to eBPF progs limit reached.

Same for runc update -- we could have a stupid test calling runc update 100 times (without actually changing anything) but I am not sure if such test makes sense if we implement what's described in the previous paragraph.

kolyshkin
kolyshkin previously approved these changes May 15, 2021
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the missing integration test(s).

@cyphar
Copy link
Member Author

cyphar commented May 15, 2021

Yeah looking at it I think we would need to rethink the "default allow" device list behaviour if we ever want to add devices support to runc update (otherwise users will accidentally brick their containers when using it). I'll cook up a test on Monday.

@cyphar
Copy link
Member Author

cyphar commented May 18, 2021

Seems as though the Microsoft PPA is down?

@kolyshkin
Copy link
Contributor

CI restarted (looks like it's working today).

@kolyshkin
Copy link
Contributor

CI restarted; GHA still has some issues :(

E: Failed to fetch https://packages.microsoft.com/ubuntu/20.04/prod/dists/focal/main/binary-amd64/Packages.bz2 File has unexpected size (73402 != 73160). Mirror sync in progress? [IP: 104.42.185.173 443]

@cyphar
Copy link
Member Author

cyphar commented May 19, 2021

CI passes again now. 😸

kolyshkin
kolyshkin previously approved these changes May 19, 2021
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor

@AkihiroSuda @mrunalp PTAL

runc run -d --console-socket "$CONSOLE_SOCKET" test_update
[ "$status" -eq 0 ]

for new_limit in $(seq 300); do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq might cause extra alloc compared to for ((i = 0; i < 300; i++)), but negligible

@@ -648,3 +648,29 @@ EOF
runc resume test_update
[ "$status" -eq 0 ]
}

@test "runc update replaces devices cgroup program" {
[[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is NOP in rootless, right? Can we have a comment line to explain the expected behavior?

Copy link
Member Author

@cyphar cyphar May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also no-op for cgroupv1... Yeah I'll add a comment.

retries++
continue
}
return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may get EINTR wand want to retry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bpf(2) is considered EINTR-safe. Can not quote any sources though :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that you can get EINTR from bpf (especially in Go thanks to green thread preemption). cilium/ebpf#24

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit saying that it is only about BPF_PROG_TEST_RUN (and so I was wrong about bpf being EINTR-safe), and this is the only bpf call that cilium/ebpf wraps in retry-on-eintr loop.

In some other place (cilium/ebpf#207 (comment)) they deduce that bpf syscall with some other parameters could not return EINTR, so maybe this behavior is limited to BPF_PROG_TEST_RUN.

OTOH it won't hurt to add retry-on-EINTR loop, something like:

for {
	_, _, errno := unix.Syscall(unix.SYS_BPF,
		uintptr(unix.BPF_PROG_QUERY),
		uintptr(unsafe.Pointer(&query)),
		unsafe.Sizeof(query))
	if errno != unix.EINTR {
		break
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found another case -- they retry on bpf(BPF_PROG_LOAD) but on EAGAIN not EINTR.

So I guess we don't need a EINTR handler here.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When running inside a Docker container, systemd is not available. The
new TestFdLeaksSystemd forgot to include the relevant t.Skip section.

Fixes: a7feb42 ("libct/int: add TestFdLeaksSystemd")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The devices cgroup emulator is also useful for removing unneeded rules
as well as computing what the final default-allow state of the filter
will be (allow-list or deny-list).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
There were several issues with the previous cgroupv2 devices filter
generator implementation, stemming from the previous implementation
using a few too many tricks to implement the correct cgroup behaviour
(rules were handled in reverse order, with wildcards having particularly
special interpretations). As a result, some slightly odd configurations
with rules in specific orders could result in incorrect filters being
generated.

By switching to the emulator which is already used by cgroupv1, we can
guarantee that the behaviour of filters in both cgroup versions will be
identical, as well as making use of the hardenings in the emulator (not
allowing users to add deny rules the kernel will ignore).

(Note that because the ordering of the devices emulator rules is
deterministic and based on the rule value, the existing test rules had
to be reordered slightly.)

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In the normal cases (only one existing filter or no existing filters),
just make use of BPF_F_REPLACE if there is one existing filter. However
if there is more than one filter applied, we should probably remove all
other filters since the alternative is that we will never remove our old
filters.

The only two other viable ways of solving this problem would be to use
BPF pins to either pin the eBPF program using a predictable name (so we
can always only replace *our* programs) or to switch away from custom
programs and instead use eBPF maps (which are pinned) and thus we just
update the map conntents to update the ruleset. Unfortunately these both
would add a hard requirement of bpffs and would require at least a minor
rewrite of the eBPF filtering code -- which is better left for another
time.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This is to ensure that we aren't leaking eBPF programs after "runc
update". Unfortunately we cannot directly test the behaviour of cgroup
program updates in an integration test because "runc update" doesn't
support that behaviour at the moment.

So instead we rely on the fact that each "runc update" implicitly
triggers the devices rules to be updated. Without the previous patches
applied, this new test will fail with errors (on cgroupv2 systems).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@@ -11,6 +11,7 @@ import (
"strconv"

"github.com/cilium/ebpf/asm"
devicesemulator "github.com/opencontainers/runc/libcontainer/cgroups/devices"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: devicesEmulator

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

retries++
continue
}
return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit saying that it is only about BPF_PROG_TEST_RUN (and so I was wrong about bpf being EINTR-safe), and this is the only bpf call that cilium/ebpf wraps in retry-on-eintr loop.

In some other place (cilium/ebpf#207 (comment)) they deduce that bpf syscall with some other parameters could not return EINTR, so maybe this behavior is limited to BPF_PROG_TEST_RUN.

OTOH it won't hurt to add retry-on-EINTR loop, something like:

for {
	_, _, errno := unix.Syscall(unix.SYS_BPF,
		uintptr(unix.BPF_PROG_QUERY),
		uintptr(unsafe.Pointer(&query)),
		unsafe.Sizeof(query))
	if errno != unix.EINTR {
		break
	}
}

retries++
continue
}
return nil, fmt.Errorf("bpf_prog_query(BPF_CGROUP_DEVICE) failed: %w", errno)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found another case -- they retry on bpf(BPF_PROG_LOAD) but on EAGAIN not EINTR.

So I guess we don't need a EINTR handler here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants