cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

jepio · 2021-12-03T13:10:29Z

When running on cgroup2, currently in a 1-container-pod-with-memory-limit configuration, no /tasks/oom events are generated. This is because the pod cgroup and container cgroups both get the same memory.max setting, and the oom events gets counted to the pod cgroup, while containerd monitors container cgroups for oom events. Fix that by monitoring oom_kill instead and reporting that.
oom_kill events are counted both to the pod and container cgroups.

My test case was the following kubernetes manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: progrium/stress
        resources:
          limits:
            memory: "256Mi"
        name: stress
        args:
        - "--vm-bytes"
        - "128m"
        - "--vm"
        - "10"
status: {}

Related issue: k3s-io/k3s#4572

k8s-ci-robot · 2021-12-03T13:10:38Z

Hi @jepio. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

theopenlab-ci · 2021-12-03T13:38:35Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 5m 59s (non-voting)
containerd-test-arm64 : SUCCESS in 7m 17s (non-voting)
containerd-integration-test-arm64 : SUCCESS in 26m 27s (non-voting)

theopenlab-ci · 2021-12-03T14:30:06Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 4m 21s (non-voting)
containerd-test-arm64 : SUCCESS in 5m 52s (non-voting)
containerd-integration-test-arm64 : SUCCESS in 24m 47s (non-voting)

theopenlab-ci · 2021-12-03T14:34:32Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 4m 45s (non-voting)
containerd-test-arm64 : SUCCESS in 5m 33s (non-voting)
containerd-integration-test-arm64 : FAILURE in 30m 35s (non-voting)

jepio · 2021-12-03T19:02:49Z

Windows flaked.

/retest

k8s-ci-robot · 2021-12-03T19:03:04Z

@jepio: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

Windows flaked.

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jepio · 2022-01-21T18:25:17Z

Hi, is any one willing to review this?

kzys

The change looks good to me, but I don't know much about cgroups. Also does it make sense to add some tests around that?

@AkihiroSuda - Can you take a look?

AkihiroSuda · 2022-01-31T23:36:18Z

Shouldn't this be emitted with another event type?

jepio · 2022-02-03T12:33:37Z

I don't think so - the /tasks/oom event was supposed to mean "task was killed due to OOM". An OOM event caused by a task that does not result in this task being killed is not super interesting. So I would say "oom_kill" is what should have been used all along.

But right now this is not being generated at all in the common case (single container in pod) as described in the PR. This is a consequence of limits being evaluated from the root down and the pod (slice) having the same limit as the container (scope). So the pod is OOM, but the container gets killed as a consequence.

…OOM events With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice) and container cgroup (scope) will both have the same memory limit applied. In that situation, the kernel will consider an OOM event to be triggered by the parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only sees an oom_kill increment. Since we monitor child cgroups for oom events, check the OOMKill field so that we don't miss events. This is not visible when running containers through docker or ctr, because they set the limits differently (only container level). An alternative would be to not configure limits at the pod level - that way the container limit will be hit and the OOM will be correctly generated. An interesting consequence is that when spawning a pod with multiple containers, the oom events also work correctly, because: a) if one of the containers has no limit, the pod has no limit so OOM events in another container report correctly. b) if all of the containers have limits then the pod limit will be a sum of container events, so a container will be able to hit its limit first. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>

jepio · 2022-02-03T14:01:53Z

https://github.com/containerd/containerd/blob/main/pkg/cri/server/events.go#L333-L349
This code sets status.Reason = oomExitReason and oomExitReason = "OOMKilled" so I believe looking at the oom_kill will maintain the desired semantics of the event.

jepio · 2022-02-10T16:01:28Z

@AkihiroSuda, do you find my reasoning convincing?

whites11 · 2022-03-23T14:11:08Z

What is the state of this PR? This is making Vertical Pod Autoscaler (kubernetes) not to react on OOMs correctly so has a potentially big impact of kubernetes users.

fuweid · 2022-03-23T15:56:25Z

Will take a look tomorrow~

fuweid · 2022-03-24T07:50:51Z

@jepio @AkihiroSuda This change looks good to me. This Linux patch updates cgroupv2 oom description https://lore.kernel.org/lkml/20181004214050.7417-1-guro@fb.com/T/. I think oom_kill could be more compatible with cgroupv1.

AkihiroSuda · 2022-03-24T08:04:35Z

Cherry-picking to v1.6.3 and v1.5.12

[release/1.6 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6734
[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

k8s-ci-robot added the needs-ok-to-test label Dec 3, 2021

jepio force-pushed the jepio/fix-cgroupv2-oom-event branch from 1c61c96 to be5fe70 Compare December 3, 2021 14:02

This was referenced Dec 3, 2021

v2: Fix inotify fd leak when cgroup is deleted containerd/cgroups#212

Merged

containerd-shim processes are leaking inotify instances with cgroups v2 #5670

Closed

dmcgowan added this to Ready For Review in Code Review Dec 13, 2021

kzys reviewed Jan 31, 2022

View reviewed changes

jepio force-pushed the jepio/fix-cgroupv2-oom-event branch from be5fe70 to 7275411 Compare February 3, 2022 12:39

whites11 mentioned this pull request Mar 23, 2022

VPA not detecting OOM giantswarm/roadmap#923

Closed

AkihiroSuda added impact/changelog cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch labels Mar 23, 2022

AkihiroSuda approved these changes Mar 23, 2022

View reviewed changes

AkihiroSuda requested review from dims and fuweid March 23, 2022 14:48

fuweid approved these changes Mar 24, 2022

View reviewed changes

fuweid merged commit e7cba85 into containerd:main Mar 24, 2022

Code Review automation moved this from Ready For Review to Done Mar 24, 2022

AkihiroSuda mentioned this pull request Mar 24, 2022

[release/1.6 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6734

Merged

AkihiroSuda mentioned this pull request Mar 24, 2022

[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

Merged

AkihiroSuda added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch and removed cherry-pick/1.5.x Change to be cherry picked to release/1.5 branch labels Mar 24, 2022

jepio deleted the jepio/fix-cgroupv2-oom-event branch March 24, 2022 10:56

This was referenced Jul 28, 2022

docker driver OOM killed detection flaky on cgroups v2 hashicorp/nomad#13119

Open

Flaky test: cgroup2: TestInspectOomKilledTrue fails intermittently moby/moby#41929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

jepio commented Dec 3, 2021

k8s-ci-robot commented Dec 3, 2021

theopenlab-ci bot commented Dec 3, 2021

theopenlab-ci bot commented Dec 3, 2021

theopenlab-ci bot commented Dec 3, 2021

jepio commented Dec 3, 2021

k8s-ci-robot commented Dec 3, 2021

jepio commented Jan 21, 2022

kzys left a comment

AkihiroSuda commented Jan 31, 2022

jepio commented Feb 3, 2022

jepio commented Feb 3, 2022

jepio commented Feb 10, 2022

whites11 commented Mar 23, 2022

fuweid commented Mar 23, 2022

fuweid commented Mar 24, 2022

AkihiroSuda commented Mar 24, 2022 •

edited

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

Conversation

jepio commented Dec 3, 2021

k8s-ci-robot commented Dec 3, 2021

theopenlab-ci bot commented Dec 3, 2021

theopenlab-ci bot commented Dec 3, 2021

theopenlab-ci bot commented Dec 3, 2021

jepio commented Dec 3, 2021

k8s-ci-robot commented Dec 3, 2021

jepio commented Jan 21, 2022

kzys left a comment

Choose a reason for hiding this comment

AkihiroSuda commented Jan 31, 2022

jepio commented Feb 3, 2022

jepio commented Feb 3, 2022

jepio commented Feb 10, 2022

whites11 commented Mar 23, 2022

fuweid commented Mar 23, 2022

fuweid commented Mar 24, 2022

AkihiroSuda commented Mar 24, 2022 • edited

AkihiroSuda commented Mar 24, 2022 •

edited