cgroup: devices updates appear to be broken #2366

cyphar · 2020-04-30T01:06:19Z

This affects both versions, but in quite different ways:

For cgroupv1, Don't deny all devices when update cgroup resource #2205 highlighted that on device cgroup updates, we temporarily block all devices. This results in spurious errors in the container (such as programs being unable to open /dev/null). We've seen this happen on customer systems under Kubernetes, so this is definitely a real issue.
- This is actually a more complicated issue than it first appears because runc actually incorrectly implements the spec here -- technically runc actually is a black-list by default and users have to convert runc to be a white-list. Aside from not following the spec this is a worrying security stance.
For cgroupv2, devices cgroup updates are implemented by appending a new BPF program to the cgroup. This means that only new denials have an effect, and thus it's incorrectly implemented. (EDIT: This also means that we "leak" eBPF programs and thus after 64+ applications we start getting errors -- see api, cgroupv2: skip setting the devices cgroup #2474.)
- Unfortunately this is a bit complicated to fix, but I have figured out how to do it. We need to make an eBPF map of type BPF_MAP_TYPE_PROG_ARRAYand then tail-call into it in a small stub eBPF program which we attach to the actual cgroup. This which will allow us to atomically update the devices cgroup rules (there is no way to atomically replace an eBPF program with BPF_F_ALLOW_MULTI -- and without any program, all device accesses would be permitted).
- Ignore the above -- you cannot bpf_tail_call from cgroup programs. So we will need to instead implement it through an eBPF map (which we can atomically replace by mis-using BPF_MAP_TYPE_ARRAY_OF_ARRAY).
- This is all slightly complicated by the fact that the entire API is fd-based (and we don't have our own monitor process so we can't stash away the fd). But luckily there is a lookup-by-id system which we can use to get the file descriptor, though the ids can be recycled so we'll need to be careful to make sure we don't start touching the wrong eBPF map -- and unlike eBPF programs, eBPF maps don't store information about when they were created.

Part of #2315.

The text was updated successfully, but these errors were encountered:

cyphar · 2020-04-30T01:15:03Z

Fixing the white-list problem actually isn't too bad -- we just need to re-calculate the device list (merging entries which overlap with each other) rather than blindly applying the provided list of devices in-order. Basically we should pre-calculate the minimal devices.allow and devices.deny lists.

cyphar · 2020-04-30T01:28:11Z

And actually this does bring up a different question -- what does runc update mean for devices? Does it mean that we should replace the list, or update the existing list? This is a slightly open question because devices is the only cgroup with this behaviour. I think replacing the list is simpler because it'll be harder to implement for cgroupv2 (we'd need to fetch the resource limits saved in the container state file -- which is something I want to die eventually, so adding a new requirement for it seems like a bad idea).

giuseppe · 2020-05-06T07:48:40Z

I've played with it in crun and I think we can use pinning to the bpf file system to get back a fd to the previous eBPF program.

What do you think of the solution here: containers/crun#352 ?

cyphar · 2020-05-08T04:33:40Z

Pinning is another fine solution to the issue (as much as I'm not a fan of how ugly the pinning system is). Another choice would be to use the lookup-by-id system but then we'd have to deal with IDs being able to overflow.

But while BPF_F_REPLACE is a good solution for 5.6+ kernels, for older kernels we need another solution -- my idea was to use a BPF_MAP_TYPE_ARRAY_OF_ARRAY so that we can atomically swap the rules. That way you can actually install the same program for all cgroups, and then they just look up the rules for their cgroup in a per-container eBPF map. A nice benefit is that if you do it that way, userspace can actually read the current set of rules.

cyphar · 2020-07-03T05:53:12Z

The cgroupv1 portion of this issue was fixed by #2391.

AkihiroSuda · 2020-08-05T15:14:18Z

Changed the milestone from rc92 to 1.0

AkihiroSuda · 2021-02-04T03:54:07Z

@cyphar Can we postpone this to v1.1.0?

cyphar · 2021-02-04T04:09:00Z

cgroupv2 device updates are probably fine for now, especially since we use systemd to manage cgroupv2 (and while systemd isn't perfect it does handle updates reasonably). cgroupv1 was the more important one and we've already fixed that.

kolyshkin · 2021-02-10T03:23:29Z

This also means that we "leak" eBPF programs

We also leak a file descriptor on every ebpf.LoadAttachCgroupDeviceFilter invocation (IOW on every container.Run()). In case it is used by an daemon using libcontainer, this is problematic. I wrote a test case that currently adds this leaked fd as an exclusion (see #2802).

kolyshkin · 2021-02-10T03:28:36Z

what does runc update mean for devices? Does it mean that we should replace the list, or update the existing list? This is a slightly open question because devices is the only cgroup with this behaviour. I think replacing the list is simpler because it'll be harder to implement for cgroupv2 (we'd need to fetch the resource limits saved in the container state file -- which is something I want to die eventually, so adding a new requirement for it seems like a bad idea).

Indeed, update functionality is not worded out good enough in spec. Ideally, for devices there should be some kind of "append or replace" flag (and I also hope one day we'll be able to skip reading all the resources from the state file on update, at least if "replace" is specified for devices).

If that's too complicated or redundant, we can say "runc only implements replace for devices" (if the spec will allow us too).

borgerli · 2021-03-29T02:15:19Z

@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue.

Do you have a date when the issue with cgroup v2 will be fixed? Many thanks.

giuseppe · 2021-03-29T07:48:50Z

@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue.

please be aware that k8s has still some issues with cgroup v2 (e.g., kubernetes/kubernetes#99230)

xiaoxubeii · 2021-04-01T07:32:33Z

@cyphar We're planning to run k8s+containerd with cgroup v2 in production, but this work is suspended due to this issue.

please be aware that k8s has still some issues with cgroup v2 (e.g., kubernetes/kubernetes#99230)

Thanks @giuseppe, we will keep watch for this process and do help if you need.

payall4u · 2021-04-07T13:54:05Z

Hi @cyphar. Sorry to bother you. I wonder why we must use the flag BPF_F_ALLOW_MULTI. How about using unix.BPF_F_ALLOW_OVERRIDE to replace BPF_F_ALLOW_MULTI ?

if err := prog.Attach(dirFD, ebpf.AttachCGroupDevice, unix.BPF_F_ALLOW_OVERRIDE); err != nil {
    return nilCloser, errors.Wrap(err, "failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_OVERRIDE)")
}

BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program, the program in this cgroup yields to sub-cgroup program.

BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program, that cgroup program gets run in addition to the program in this cgroup.

Only one program is allowed to be attached to a cgroup with NONE or BPF_F_ALLOW_OVERRIDE flag. Attaching another program on top of NONE or BPF_F_ALLOW_OVERRIDE will release old program and attach the new one. Attach flags has to match.

Multiple programs are allowed to be attached to a cgroup with, BPF_F_ALLOW_MULTI flag. They are executed in FIFO order (those that were attached first, run first). The programs of sub-cgroup are executed first, then programs of this cgroup and then programs of parent cgroup.

cyphar · 2021-04-07T14:29:56Z

From memory we need to use BPF_F_ALLOW_MULTI because of nested containers.

payall4u · 2021-04-08T02:26:58Z

From memory we need to use BPF_F_ALLOW_MULTI because of nested containers.

Get it, thanks.

BTW, the idea using a BPF_MAP_TYPE_ARRAY_OF_MAPS map to store rules sounds great. Is someone doing this ? :) @cyphar

kolyshkin · 2021-04-13T19:42:53Z

Perhaps we can simplify the task by using BPF_F_REPLACE (and thus requiring kernel v5.6+), and falling back to whatever we do now for the older kernel (or just don't allow device updates).

It would also help to not call setDevices in case devices are not changed. This requires some API changes though (for cgroup manager's Set to be able to see that current set and new set are identical)

cyphar · 2021-04-14T05:03:48Z

BPF_F_REPLACE has the minor issue that we now need to track which program is the one we applied (either by using the metadata about the program or using the bpf pinning API) -- though the simple solution is to assume that whatever program is attached to the cgroup is the one we want to replace.

This reverts commit 07f5e84. k8s+containerd with cgroup v2 is currently not working correctly due to opencontainers/runc#2366

AkihiroSuda · 2021-05-24T12:56:52Z

Merged #2951

kolyshkin mentioned this issue Apr 30, 2020

cgroupv2 support meta issue #2315

Closed

AkihiroSuda added the kind/bug label Apr 30, 2020

cyphar mentioned this issue May 8, 2020

cgroup: devices: major cleanups and minimal transition rules #2391

Merged

AkihiroSuda added this to the 1.0.0-rc91 (tentative, aka rc12) milestone May 28, 2020

cyphar mentioned this issue Jul 3, 2020

libct/cgroups: add SkipDevices to Resources #2490

Merged

cyphar mentioned this issue Jul 3, 2020

api, cgroupv2: skip setting the devices cgroup #2474

Closed

AkihiroSuda modified the milestones: 1.0.0-rc92 (nee rc12), 1.0.0 Aug 5, 2020

cyphar modified the milestones: 1.0.0, post-1.0 Feb 4, 2021

kolyshkin added area/cgroupv2 area/ebpf labels Feb 10, 2021

kolyshkin mentioned this issue Feb 10, 2021

libct: fix some container.Run fd leaks, add fd leak test #2802

Merged

borgerli mentioned this issue Mar 28, 2021

Do not set devices cgroup when updating containers to avoid eBPF programs leak #2876

Closed

thaJeztah mentioned this issue Apr 13, 2021

ebpf: replace deprecated prog.Attach/prog.Detach #2903

Merged

AkihiroSuda mentioned this issue Apr 21, 2021

Updating a containers DeviceCgroupRules not working moby/moby#42255

Open

etungsten added a commit to etungsten/bottlerocket that referenced this issue Apr 28, 2021

Revert "aws-k8s-1.20: enable unified cgroup hierarchy"

5a16da7

This reverts commit 07f5e84. k8s+containerd with cgroup v2 is currently not working correctly due to opencontainers/runc#2366

etungsten mentioned this issue Apr 28, 2021

Revert "aws-k8s-1.20: enable unified cgroup hierarchy" bottlerocket-os/bottlerocket#1533

Merged

klueska mentioned this issue May 10, 2021

GPU becomes unavailable after some time in Docker container NVIDIA/nvidia-docker#1469

Closed

9 tasks

This was referenced May 10, 2021

cgroupv2 devices needs unit tests and a minor rework #2797

Closed

cgroup2: devices filtering cleanup #2951

Merged

AkihiroSuda closed this as completed May 24, 2021

elezar mentioned this issue Oct 14, 2021

nvidia-smi killed after a while NVIDIA/gpu-operator#271

Open

klueska mentioned this issue Dec 14, 2021

Add logic to only call CPUManager Update() if state different than last Update() kubernetes/kubernetes#101771

Merged

klueska mentioned this issue Feb 9, 2022

pod fail to find gpu some time after created NVIDIA/k8s-device-plugin#289

Closed

11 tasks

agners mentioned this issue Mar 3, 2022

Support LinuxDeviceCgroup updates #3401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup: devices updates appear to be broken #2366

cgroup: devices updates appear to be broken #2366

cyphar commented Apr 30, 2020 •

edited

cyphar commented Apr 30, 2020 •

edited

cyphar commented Apr 30, 2020

giuseppe commented May 6, 2020

cyphar commented May 8, 2020

cyphar commented Jul 3, 2020

AkihiroSuda commented Aug 5, 2020

AkihiroSuda commented Feb 4, 2021

cyphar commented Feb 4, 2021 •

edited

kolyshkin commented Feb 10, 2021 •

edited

kolyshkin commented Feb 10, 2021

borgerli commented Mar 29, 2021

giuseppe commented Mar 29, 2021

xiaoxubeii commented Apr 1, 2021 •

edited

payall4u commented Apr 7, 2021 •

edited

cyphar commented Apr 7, 2021

payall4u commented Apr 8, 2021 •

edited

kolyshkin commented Apr 13, 2021

cyphar commented Apr 14, 2021

AkihiroSuda commented May 24, 2021

cgroup: devices updates appear to be broken #2366

cgroup: devices updates appear to be broken #2366

Comments

cyphar commented Apr 30, 2020 • edited

cyphar commented Apr 30, 2020 • edited

cyphar commented Apr 30, 2020

giuseppe commented May 6, 2020

cyphar commented May 8, 2020

cyphar commented Jul 3, 2020

AkihiroSuda commented Aug 5, 2020

AkihiroSuda commented Feb 4, 2021

cyphar commented Feb 4, 2021 • edited

kolyshkin commented Feb 10, 2021 • edited

kolyshkin commented Feb 10, 2021

borgerli commented Mar 29, 2021

giuseppe commented Mar 29, 2021

xiaoxubeii commented Apr 1, 2021 • edited

payall4u commented Apr 7, 2021 • edited

cyphar commented Apr 7, 2021

payall4u commented Apr 8, 2021 • edited

kolyshkin commented Apr 13, 2021

cyphar commented Apr 14, 2021

AkihiroSuda commented May 24, 2021

cyphar commented Apr 30, 2020 •

edited

cyphar commented Apr 30, 2020 •

edited

cyphar commented Feb 4, 2021 •

edited

kolyshkin commented Feb 10, 2021 •

edited

xiaoxubeii commented Apr 1, 2021 •

edited

payall4u commented Apr 7, 2021 •

edited

payall4u commented Apr 8, 2021 •

edited