Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

Merged
merged 1 commit into from Mar 24, 2022

Conversation

AkihiroSuda
Copy link
Member

Cherry-pick (clean)

When running on cgroup2, currently in a 1-container-pod-with-memory-limit configuration, no /tasks/oom events are generated. This is because the pod cgroup and container cgroups both get the same memory.max setting, and the oom events gets counted to the pod cgroup, while containerd monitors container cgroups for oom events. Fix that by monitoring oom_kill instead and reporting that.
oom_kill events are counted both to the pod and container cgroups.

My test case was the following kubernetes manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: progrium/stress
        resources:
          limits:
            memory: "256Mi"
        name: stress
        args:
        - "--vm-bytes"
        - "128m"
        - "--vm"
        - "10"
status: {}

Related issue: k3s-io/k3s#4572

…OOM events

With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.

This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:

a) if one of the containers has no limit, the pod has no limit so OOM events in
   another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
   container events, so a container will be able to hit its limit first.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
(cherry picked from commit 7275411)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 24, 2022

Build succeeded.

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp merged commit a553ec5 into containerd:release/1.5 Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants