[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

AkihiroSuda · 2022-03-24T08:06:21Z

Cherry-pick (clean)

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

When running on cgroup2, currently in a 1-container-pod-with-memory-limit configuration, no /tasks/oom events are generated. This is because the pod cgroup and container cgroups both get the same memory.max setting, and the oom events gets counted to the pod cgroup, while containerd monitors container cgroups for oom events. Fix that by monitoring oom_kill instead and reporting that.
oom_kill events are counted both to the pod and container cgroups.

My test case was the following kubernetes manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: stress
  name: stress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: stress
    spec:
      containers:
      - image: progrium/stress
        resources:
          limits:
            memory: "256Mi"
        name: stress
        args:
        - "--vm-bytes"
        - "128m"
        - "--vm"
        - "10"
status: {}
Related issue: k3s-io/k3s#4572

…OOM events With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice) and container cgroup (scope) will both have the same memory limit applied. In that situation, the kernel will consider an OOM event to be triggered by the parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only sees an oom_kill increment. Since we monitor child cgroups for oom events, check the OOMKill field so that we don't miss events. This is not visible when running containers through docker or ctr, because they set the limits differently (only container level). An alternative would be to not configure limits at the pod level - that way the container limit will be hit and the OOM will be correctly generated. An interesting consequence is that when spawning a pod with multiple containers, the oom events also work correctly, because: a) if one of the containers has no limit, the pod has no limit so OOM events in another container report correctly. b) if all of the containers have limits then the pod limit will be a sum of container events, so a container will be able to hit its limit first. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com> (cherry picked from commit 7275411) Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>

theopenlab-ci · 2022-03-24T08:13:55Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 4m 22s (non-voting)
containerd-test-arm64 : RETRY_LIMIT in 22s (non-voting)
containerd-integration-test-arm64 : FAILURE in 1m 36s (non-voting)

estesp

LGTM

AkihiroSuda mentioned this pull request Mar 24, 2022

cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323

Merged

fuweid approved these changes Mar 24, 2022

View reviewed changes

estesp approved these changes Mar 24, 2022

View reviewed changes

estesp merged commit a553ec5 into containerd:release/1.5 Mar 24, 2022

voelzmo mentioned this pull request Jul 11, 2022

Use systemd as cgroup driver gardener/gardener#5325

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

AkihiroSuda commented Mar 24, 2022

theopenlab-ci bot commented Mar 24, 2022

estesp left a comment

[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

[release/1.5 backport] cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6735

Conversation

AkihiroSuda commented Mar 24, 2022

theopenlab-ci bot commented Mar 24, 2022

estesp left a comment

Choose a reason for hiding this comment