New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup2: monitor OOMKill instead of OOM to prevent missing container events #6323
Conversation
Hi @jepio. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Build succeeded.
|
1c61c96
to
be5fe70
Compare
Build succeeded.
|
Build succeeded.
|
Windows flaked. /retest |
@jepio: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi, is any one willing to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me, but I don't know much about cgroups. Also does it make sense to add some tests around that?
@AkihiroSuda - Can you take a look?
Shouldn't this be emitted with another event type? |
I don't think so - the /tasks/oom event was supposed to mean "task was killed due to OOM". An OOM event caused by a task that does not result in this task being killed is not super interesting. So I would say "oom_kill" is what should have been used all along. But right now this is not being generated at all in the common case (single container in pod) as described in the PR. This is a consequence of limits being evaluated from the root down and the pod (slice) having the same limit as the container (scope). So the pod is OOM, but the container gets killed as a consequence. |
…OOM events With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice) and container cgroup (scope) will both have the same memory limit applied. In that situation, the kernel will consider an OOM event to be triggered by the parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only sees an oom_kill increment. Since we monitor child cgroups for oom events, check the OOMKill field so that we don't miss events. This is not visible when running containers through docker or ctr, because they set the limits differently (only container level). An alternative would be to not configure limits at the pod level - that way the container limit will be hit and the OOM will be correctly generated. An interesting consequence is that when spawning a pod with multiple containers, the oom events also work correctly, because: a) if one of the containers has no limit, the pod has no limit so OOM events in another container report correctly. b) if all of the containers have limits then the pod limit will be a sum of container events, so a container will be able to hit its limit first. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
be5fe70
to
7275411
Compare
https://github.com/containerd/containerd/blob/main/pkg/cri/server/events.go#L333-L349 |
@AkihiroSuda, do you find my reasoning convincing? |
What is the state of this PR? This is making Vertical Pod Autoscaler (kubernetes) not to react on OOMs correctly so has a potentially big impact of kubernetes users. |
Will take a look tomorrow~ |
@jepio @AkihiroSuda This change looks good to me. This Linux patch updates cgroupv2 oom description https://lore.kernel.org/lkml/20181004214050.7417-1-guro@fb.com/T/. I think oom_kill could be more compatible with cgroupv1. |
When running on cgroup2, currently in a 1-container-pod-with-memory-limit configuration, no
/tasks/oom
events are generated. This is because the pod cgroup and container cgroups both get the samememory.max
setting, and theoom
events gets counted to the pod cgroup, while containerd monitors container cgroups for oom events. Fix that by monitoringoom_kill
instead and reporting that.oom_kill
events are counted both to the pod and container cgroups.My test case was the following kubernetes manifest:
Related issue: k3s-io/k3s#4572