Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

Merged
merged 1 commit into from Mar 24, 2022

Conversation

jepio
Copy link
Contributor

@jepio jepio commented Dec 3, 2021

When running on cgroup2, currently in a 1-container-pod-with-memory-limit configuration, no /tasks/oom events are generated. This is because the pod cgroup and container cgroups both get the same memory.max setting, and the oom events gets counted to the pod cgroup, while containerd monitors container cgroups for oom events. Fix that by monitoring oom_kill instead and reporting that.
oom_kill events are counted both to the pod and container cgroups.

My test case was the following kubernetes manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: progrium/stress
        resources:
          limits:
            memory: "256Mi"
        name: stress
        args:
        - "--vm-bytes"
        - "128m"
        - "--vm"
        - "10"
status: {}

Related issue: k3s-io/k3s#4572

@k8s-ci-robot
Copy link

Hi @jepio. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Dec 3, 2021

Build succeeded.

@jepio jepio force-pushed the jepio/fix-cgroupv2-oom-event branch from 1c61c96 to be5fe70 Compare December 3, 2021 14:02
@theopenlab-ci
Copy link

theopenlab-ci bot commented Dec 3, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Dec 3, 2021

Build succeeded.

@jepio
Copy link
Contributor Author

jepio commented Dec 3, 2021

Windows flaked.

/retest

@k8s-ci-robot
Copy link

@jepio: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

Windows flaked.

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dmcgowan dmcgowan added this to Ready For Review in Code Review Dec 13, 2021
@jepio
Copy link
Contributor Author

jepio commented Jan 21, 2022

Hi, is any one willing to review this?

Copy link
Member

@kzys kzys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, but I don't know much about cgroups. Also does it make sense to add some tests around that?

@AkihiroSuda - Can you take a look?

@AkihiroSuda
Copy link
Member

Shouldn't this be emitted with another event type?

@jepio
Copy link
Contributor Author

jepio commented Feb 3, 2022

I don't think so - the /tasks/oom event was supposed to mean "task was killed due to OOM". An OOM event caused by a task that does not result in this task being killed is not super interesting. So I would say "oom_kill" is what should have been used all along.

But right now this is not being generated at all in the common case (single container in pod) as described in the PR. This is a consequence of limits being evaluated from the root down and the pod (slice) having the same limit as the container (scope). So the pod is OOM, but the container gets killed as a consequence.

…OOM events

With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.

This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:

a) if one of the containers has no limit, the pod has no limit so OOM events in
   another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
   container events, so a container will be able to hit its limit first.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
@jepio jepio force-pushed the jepio/fix-cgroupv2-oom-event branch from be5fe70 to 7275411 Compare February 3, 2022 12:39
@jepio
Copy link
Contributor Author

jepio commented Feb 3, 2022

https://github.com/containerd/containerd/blob/main/pkg/cri/server/events.go#L333-L349
This code sets status.Reason = oomExitReason and oomExitReason = "OOMKilled" so I believe looking at the oom_kill will maintain the desired semantics of the event.

@jepio
Copy link
Contributor Author

jepio commented Feb 10, 2022

@AkihiroSuda, do you find my reasoning convincing?

@whites11
Copy link

What is the state of this PR? This is making Vertical Pod Autoscaler (kubernetes) not to react on OOMs correctly so has a potentially big impact of kubernetes users.

@AkihiroSuda AkihiroSuda added impact/changelog cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch labels Mar 23, 2022
@AkihiroSuda AkihiroSuda requested review from dims and fuweid March 23, 2022 14:48
@fuweid
Copy link
Member

fuweid commented Mar 23, 2022

Will take a look tomorrow~

@fuweid
Copy link
Member

fuweid commented Mar 24, 2022

@jepio @AkihiroSuda This change looks good to me. This Linux patch updates cgroupv2 oom description https://lore.kernel.org/lkml/20181004214050.7417-1-guro@fb.com/T/. I think oom_kill could be more compatible with cgroupv1.

@fuweid fuweid merged commit e7cba85 into containerd:main Mar 24, 2022
@AkihiroSuda
Copy link
Member

AkihiroSuda commented Mar 24, 2022

@AkihiroSuda AkihiroSuda added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-pick/1.5.x Change to be cherry picked to release/1.5 branch and removed cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch labels Mar 24, 2022
@AkihiroSuda AkihiroSuda added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch and removed cherry-pick/1.5.x Change to be cherry picked to release/1.5 branch labels Mar 24, 2022
@jepio jepio deleted the jepio/fix-cgroupv2-oom-event branch March 24, 2022 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch impact/changelog needs-ok-to-test
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants