Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup: devices updates appear to be broken #2366

Closed
1 of 2 tasks
cyphar opened this issue Apr 30, 2020 · 19 comments
Closed
1 of 2 tasks

cgroup: devices updates appear to be broken #2366

cyphar opened this issue Apr 30, 2020 · 19 comments

Comments

@cyphar
Copy link
Member

cyphar commented Apr 30, 2020

This affects both versions, but in quite different ways:

  • For cgroupv1, Don't deny all devices when update cgroup resource #2205 highlighted that on device cgroup updates, we temporarily block all devices. This results in spurious errors in the container (such as programs being unable to open /dev/null). We've seen this happen on customer systems under Kubernetes, so this is definitely a real issue.

    • This is actually a more complicated issue than it first appears because runc actually incorrectly implements the spec here -- technically runc actually is a black-list by default and users have to convert runc to be a white-list. Aside from not following the spec this is a worrying security stance.
  • For cgroupv2, devices cgroup updates are implemented by appending a new BPF program to the cgroup. This means that only new denials have an effect, and thus it's incorrectly implemented. (EDIT: This also means that we "leak" eBPF programs and thus after 64+ applications we start getting errors -- see api, cgroupv2: skip setting the devices cgroup #2474.)

    • Unfortunately this is a bit complicated to fix, but I have figured out how to do it. We need to make an eBPF map of type BPF_MAP_TYPE_PROG_ARRAYand then tail-call into it in a small stub eBPF program which we attach to the actual cgroup. This which will allow us to atomically update the devices cgroup rules (there is no way to atomically replace an eBPF program with BPF_F_ALLOW_MULTI -- and without any program, all device accesses would be permitted).
    • Ignore the above -- you cannot bpf_tail_call from cgroup programs. So we will need to instead implement it through an eBPF map (which we can atomically replace by mis-using BPF_MAP_TYPE_ARRAY_OF_ARRAY).
    • This is all slightly complicated by the fact that the entire API is fd-based (and we don't have our own monitor process so we can't stash away the fd). But luckily there is a lookup-by-id system which we can use to get the file descriptor, though the ids can be recycled so we'll need to be careful to make sure we don't start touching the wrong eBPF map -- and unlike eBPF programs, eBPF maps don't store information about when they were created.

Part of #2315.

@cyphar
Copy link
Member Author

cyphar commented Apr 30, 2020

Fixing the white-list problem actually isn't too bad -- we just need to re-calculate the device list (merging entries which overlap with each other) rather than blindly applying the provided list of devices in-order. Basically we should pre-calculate the minimal devices.allow and devices.deny lists.

@cyphar
Copy link
Member Author

cyphar commented Apr 30, 2020

And actually this does bring up a different question -- what does runc update mean for devices? Does it mean that we should replace the list, or update the existing list? This is a slightly open question because devices is the only cgroup with this behaviour. I think replacing the list is simpler because it'll be harder to implement for cgroupv2 (we'd need to fetch the resource limits saved in the container state file -- which is something I want to die eventually, so adding a new requirement for it seems like a bad idea).

@giuseppe
Copy link
Member

giuseppe commented May 6, 2020

I've played with it in crun and I think we can use pinning to the bpf file system to get back a fd to the previous eBPF program.

What do you think of the solution here: containers/crun#352 ?

@cyphar
Copy link
Member Author

cyphar commented May 8, 2020

Pinning is another fine solution to the issue (as much as I'm not a fan of how ugly the pinning system is). Another choice would be to use the lookup-by-id system but then we'd have to deal with IDs being able to overflow.

But while BPF_F_REPLACE is a good solution for 5.6+ kernels, for older kernels we need another solution -- my idea was to use a BPF_MAP_TYPE_ARRAY_OF_ARRAY so that we can atomically swap the rules. That way you can actually install the same program for all cgroups, and then they just look up the rules for their cgroup in a per-container eBPF map. A nice benefit is that if you do it that way, userspace can actually read the current set of rules.

@cyphar
Copy link
Member Author

cyphar commented Jul 3, 2020

The cgroupv1 portion of this issue was fixed by #2391.

@AkihiroSuda
Copy link
Member

Changed the milestone from rc92 to 1.0

@AkihiroSuda
Copy link
Member

@cyphar Can we postpone this to v1.1.0?

@cyphar
Copy link
Member Author

cyphar commented Feb 4, 2021

cgroupv2 device updates are probably fine for now, especially since we use systemd to manage cgroupv2 (and while systemd isn't perfect it does handle updates reasonably). cgroupv1 was the more important one and we've already fixed that.

@cyphar cyphar modified the milestones: 1.0.0, post-1.0 Feb 4, 2021
@kolyshkin
Copy link
Contributor

kolyshkin commented Feb 10, 2021

This also means that we "leak" eBPF programs

We also leak a file descriptor on every ebpf.LoadAttachCgroupDeviceFilter invocation (IOW on every container.Run()). In case it is used by an daemon using libcontainer, this is problematic. I wrote a test case that currently adds this leaked fd as an exclusion (see #2802).

@kolyshkin
Copy link
Contributor

what does runc update mean for devices? Does it mean that we should replace the list, or update the existing list? This is a slightly open question because devices is the only cgroup with this behaviour. I think replacing the list is simpler because it'll be harder to implement for cgroupv2 (we'd need to fetch the resource limits saved in the container state file -- which is something I want to die eventually, so adding a new requirement for it seems like a bad idea).

Indeed, update functionality is not worded out good enough in spec. Ideally, for devices there should be some kind of "append or replace" flag (and I also hope one day we'll be able to skip reading all the resources from the state file on update, at least if "replace" is specified for devices).

If that's too complicated or redundant, we can say "runc only implements replace for devices" (if the spec will allow us too).

@borgerli
Copy link

@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue.

Do you have a date when the issue with cgroup v2 will be fixed? Many thanks.

@giuseppe
Copy link
Member

@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue.

please be aware that k8s has still some issues with cgroup v2 (e.g., kubernetes/kubernetes#99230)

@xiaoxubeii
Copy link

xiaoxubeii commented Apr 1, 2021

@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue.

please be aware that k8s has still some issues with cgroup v2 (e.g., kubernetes/kubernetes#99230)

Thanks @giuseppe, we will keep watch for this process and do help if you need.

@payall4u
Copy link

payall4u commented Apr 7, 2021

Hi @cyphar. Sorry to bother you. I wonder why we must use the flag BPF_F_ALLOW_MULTI. How about using unix.BPF_F_ALLOW_OVERRIDE to replace BPF_F_ALLOW_MULTI ?

if err := prog.Attach(dirFD, ebpf.AttachCGroupDevice, unix.BPF_F_ALLOW_OVERRIDE); err != nil {
    return nilCloser, errors.Wrap(err, "failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_OVERRIDE)")
}
  • BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program, the program in this cgroup yields to sub-cgroup program.

  • BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program, that cgroup program gets run in addition to the program in this cgroup.

  • Only one program is allowed to be attached to a cgroup with NONE or BPF_F_ALLOW_OVERRIDE flag. Attaching another program on top of NONE or BPF_F_ALLOW_OVERRIDE will release old program and attach the new one. Attach flags has to match.

  • Multiple programs are allowed to be attached to a cgroup with, BPF_F_ALLOW_MULTI flag. They are executed in FIFO order (those that were attached first, run first). The programs of sub-cgroup are executed first, then programs of this cgroup and then programs of parent cgroup.

@cyphar
Copy link
Member Author

cyphar commented Apr 7, 2021

From memory we need to use BPF_F_ALLOW_MULTI because of nested containers.

@payall4u
Copy link

payall4u commented Apr 8, 2021

From memory we need to use BPF_F_ALLOW_MULTI because of nested containers.

Get it, thanks.

BTW, the idea using a BPF_MAP_TYPE_ARRAY_OF_MAPS map to store rules sounds great. Is someone doing this ? :) @cyphar

@kolyshkin
Copy link
Contributor

Perhaps we can simplify the task by using BPF_F_REPLACE (and thus requiring kernel v5.6+), and falling back to whatever we do now for the older kernel (or just don't allow device updates).

It would also help to not call setDevices in case devices are not changed. This requires some API changes though (for cgroup manager's Set to be able to see that current set and new set are identical)

@cyphar
Copy link
Member Author

cyphar commented Apr 14, 2021

BPF_F_REPLACE has the minor issue that we now need to track which program is the one we applied (either by using the metadata about the program or using the bpf pinning API) -- though the simple solution is to assume that whatever program is attached to the cgroup is the one we want to replace.

@AkihiroSuda
Copy link
Member

Merged #2951

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants