containerd-shim processes are leaking inotify instances with cgroups v2 #5670

bharathguvvala · 2021-06-29T09:39:35Z

Description

containerd-shim processes are leaking inotify instances when setup with cgroups v2. The following is the count of the instances held by the shim processes (other processes have been filtered out from the output). The count only goes up as time progresses. With cgroups v1 there are no inotify instances being held by the container-shim processes. I suspect that this could be due to a runc PR to listen to oom events which uses the inotify mechanism. but that code is cleanly handling the inotify instances and closing them. Since I have not much insight into the containerd-shim and runc interaction, I suspect this could be due to some corner case in the interaction that's causing this.

To add more context, this is running on a kubernetes node with containerd as the runtime. Pods which are crash looping are contributing more to the problem.

$ find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c %a %P' -p '{}' | uniq -c | sort -nr | grep root

COUNT PID USER COMMAND COMMAND PPID
46 3276111 root containerd-shim /usr/local/bin/containerd-s 1
15 1585532 root containerd-shim /usr/local/bin/containerd-s 1
14 19306 root containerd-shim /usr/local/bin/containerd-s 1
14 19279 root containerd-shim /usr/local/bin/containerd-s 1
7 2937840 root containerd-shim /usr/local/bin/containerd-s 1
3 28459 root containerd-shim /usr/local/bin/containerd-s 1
3 19232 root containerd-shim /usr/local/bin/containerd-s 1
2 3251811 root containerd-shim /usr/local/bin/containerd-s 1
2 19200 root containerd-shim /usr/local/bin/containerd-s 1
2 19156 root containerd-shim /usr/local/bin/containerd-s 1
1 3602946 root containerd-shim /usr/local/bin/containerd-s 1
1 3171279 root containerd-shim /usr/local/bin/containerd-s 1
1 213009 root containerd /usr/local/bin/containerd 1

Steps to reproduce the issue:

Setup containerd on linux machine with cgroups v2
List the inotify instances held by the containerd-shims and observe over a period of time

Describe the results you received:
The inotify instances are increasing in an unbounded manner causing concern that the fs.inotify.max_user_instances would be breached over time. The impact on containerd-shim when the limit breaches is not known to me right now. If this were in kubernetes, 'kubectl logs -f' to tail logs on any pod on that node would fail unable to create a new inotify instance.

Describe the results you expected:
The number of instances should be bounded and should not increase/leak.

What version of containerd are you using:

$ containerd --version
containerd github.com/containerd/containerd v1.5.0 8c906ff108ac28da23f69cc7b74f8e7a470d1df0

Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.):

runc --version

$ runc --version
runc version 1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.16.3
libseccomp: 2.3.3

uname -a

$ uname -a
Linux sparrow-hyd-playground-1-fk-prod-nodes-4-8340279 5.10.0-0.bpo.3-cloud-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

bharathguvvala · 2021-06-29T09:42:20Z

@AkihiroSuda @Random-Liu Can someone help me debug this?

AkihiroSuda · 2021-06-29T10:09:38Z

Thanks for reporting, seems happening around here https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L551-L640

Do you know whether the issue is specific to Kubernetes? How long minutes does it take to hit the issue?

bharathguvvala · 2021-06-29T10:53:07Z

@AkihiroSuda Thanks for the pointer.

Thanks for reporting, seems happening around here https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L551-L640

Do you know whether the issue is specific to Kubernetes? How long minutes does it take to hit the issue?

Not sure if this is specific to kubernetes. It doesn't take long to observe this issue since we have very many number of pods running per node. However I have noticed that when a container crashes the count goes up by 1 suggesting that the previous inotify instance was not reaped.

mikebrow · 2021-06-29T15:25:48Z

https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L578-L579
order of operations for defer.. close is currently first (because golang processes them in reverse order)... I wouldn't think we can remove the watch after the fd has been closed because the fd can be reused once closed..?

bharathguvvala · 2021-06-29T22:21:59Z

https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L578-L579
order of operations for defer.. close is currently first (because golang processes them in reverse order)... I wouldn't think we can remove the watch after the fd has been closed because the fd can be reused once closed..?

Quoting from the inotify man page:

"When all file descriptors referring to an inotify instance have been closed (using close(2)), the underlying object and its resources are freed for reuse by the kernel; all
associated watches are automatically freed."

@mikebrow Good catch. But considering the above info, shouldn't the watches be freed automatically as the fd is closed thus making the first defer statement redundant?
In fact the observation is that the FDs are getting leaked.

mikebrow · 2021-06-29T23:22:28Z

"When all file descriptors referring to an inotify instance have been closed (using close(2)), the underlying object and its resources are freed for reuse by the kernel; all
associated watches are automatically freed."

nod..

alakesh · 2021-07-14T23:53:27Z

@bharathguvvala How are you running containerd with cgroup v2? I did not find a good documentation around that. Trying to repro this issue on my setup.

bharathguvvala · 2021-10-22T09:46:13Z

@alakesh Sorry about the delayed response. You can start here. You need to configure the cgroupv2 controller in the corresponding operating system grub (debian example) .

jepio · 2021-12-03T15:32:26Z

Hi, I tracked this issue down: it's not specific to k8s but any situation where a long-running containerd-shim
restarts containers. Every container restart is one more leaked inotify instance, because the goroutine does not currently detect container exits. This does not happen on cgroup1 because the associated eventfd is signalled when container exits and released here https://github.com/containerd/containerd/blob/main/pkg/oom/v1/v1.go#L124-L130.

This is easy to reproduce on current Flatcar releases because we switched to cgroup2. Run k8s and start a crashing container that gets auto-restarted (the manifest from #6323 is enough).

I've prepared a fix here containerd/cgroups#212, which mirrors what we do for cgroup1 (get notified when container exits). I'll be happy to prepare a PR here to update containerd/cgroups when that's merged. We'll also need this change to gracefully handle exits:

commit 6d2d2e7ba3eb3f25d7c4f6c10b025a7636428559 (HEAD -> jepio/fix-cgroupv2-oom-event)
Author: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Date:   Fri Dec 3 13:53:19 2021 +0000

    cgroup2: handle EventChan closure gracefully

    EventChan() now terminates when the cgroup exits, and closes the errorChannel.
    This is expected so check the error and skip logging when error is nil.

    Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>

diff --git a/pkg/oom/v2/v2.go b/pkg/oom/v2/v2.go
index c0717eb99..8b3e0d387 100644
--- a/pkg/oom/v2/v2.go
+++ b/pkg/oom/v2/v2.go
@@ -102,10 +102,12 @@ func (w *watcher) Add(id string, cgx interface{}) error {
                                i.ev = ev
                                w.itemCh <- i
                        case err := <-errCh:
-                               i.err = err
-                               w.itemCh <- i
-                               // we no longer get any event/err when we got an err
-                               logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
+                               if err != nil {
+                                       i.err = err
+                                       w.itemCh <- i
+                                       // we no longer get any event/err when we got an err
+                                       logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
+                               }
                                return
                        }

Helps to mitigate the inotify fd leak issue related with containerd runtime: containerd/containerd#5670 Created as a flag to be able to support both docker and containerd runtime setups

kzys · 2022-05-19T16:18:46Z

It should be fixed by #6498. Thanks @jepio! I'm going to backport that to 1.5.x.

kzys · 2022-05-19T16:37:19Z

#6961 will fix the issue in 1.5.x.

kzys · 2022-06-06T18:40:53Z

Resolving.

bharathguvvala added the kind/bug label Jun 29, 2021

AkihiroSuda added the priority/P1 label Jun 29, 2021

george-angel mentioned this issue Nov 29, 2021

containerd-shim processes are leaking inotify instances with cgroups v2 flatcar/Flatcar#563

Closed

jepio mentioned this issue Dec 3, 2021

v2: Fix inotify fd leak when cgroup is deleted containerd/cgroups#212

Merged

ffilippopoulos mentioned this issue Jan 27, 2022

Flag to allow using deprecated docker runtime utilitywarehouse/tf_kube_ignition#178

Closed

jepio mentioned this issue Feb 3, 2022

pkg/oom/v2: handle EventChan routine shutdown quietly #6504

Merged

kzys added the cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch label May 20, 2022

kzys closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd-shim processes are leaking inotify instances with cgroups v2 #5670

containerd-shim processes are leaking inotify instances with cgroups v2 #5670

bharathguvvala commented Jun 29, 2021 •

edited

bharathguvvala commented Jun 29, 2021

AkihiroSuda commented Jun 29, 2021

bharathguvvala commented Jun 29, 2021 •

edited

mikebrow commented Jun 29, 2021 •

edited

bharathguvvala commented Jun 29, 2021 •

edited

mikebrow commented Jun 29, 2021

alakesh commented Jul 14, 2021

bharathguvvala commented Oct 22, 2021

jepio commented Dec 3, 2021

kzys commented May 19, 2022

kzys commented May 19, 2022

kzys commented Jun 6, 2022

containerd-shim processes are leaking inotify instances with cgroups v2 #5670

containerd-shim processes are leaking inotify instances with cgroups v2 #5670

Comments

bharathguvvala commented Jun 29, 2021 • edited

bharathguvvala commented Jun 29, 2021

AkihiroSuda commented Jun 29, 2021

bharathguvvala commented Jun 29, 2021 • edited

mikebrow commented Jun 29, 2021 • edited

bharathguvvala commented Jun 29, 2021 • edited

mikebrow commented Jun 29, 2021

alakesh commented Jul 14, 2021

bharathguvvala commented Oct 22, 2021

jepio commented Dec 3, 2021

kzys commented May 19, 2022

kzys commented May 19, 2022

kzys commented Jun 6, 2022

bharathguvvala commented Jun 29, 2021 •

edited

bharathguvvala commented Jun 29, 2021 •

edited

mikebrow commented Jun 29, 2021 •

edited

bharathguvvala commented Jun 29, 2021 •

edited