Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd-shim processes are leaking inotify instances with cgroups v2 #5670

Closed
bharathguvvala opened this issue Jun 29, 2021 · 12 comments
Closed
Labels
cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch kind/bug priority/P1

Comments

@bharathguvvala
Copy link

bharathguvvala commented Jun 29, 2021

Description

containerd-shim processes are leaking inotify instances when setup with cgroups v2. The following is the count of the instances held by the shim processes (other processes have been filtered out from the output). The count only goes up as time progresses. With cgroups v1 there are no inotify instances being held by the container-shim processes. I suspect that this could be due to a runc PR to listen to oom events which uses the inotify mechanism. but that code is cleanly handling the inotify instances and closing them. Since I have not much insight into the containerd-shim and runc interaction, I suspect this could be due to some corner case in the interaction that's causing this.

To add more context, this is running on a kubernetes node with containerd as the runtime. Pods which are crash looping are contributing more to the problem.

$ find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c %a %P' -p '{}' | uniq -c | sort -nr | grep root

COUNT PID USER COMMAND COMMAND PPID
46 3276111 root containerd-shim /usr/local/bin/containerd-s 1
15 1585532 root containerd-shim /usr/local/bin/containerd-s 1
14 19306 root containerd-shim /usr/local/bin/containerd-s 1
14 19279 root containerd-shim /usr/local/bin/containerd-s 1
7 2937840 root containerd-shim /usr/local/bin/containerd-s 1
3 28459 root containerd-shim /usr/local/bin/containerd-s 1
3 19232 root containerd-shim /usr/local/bin/containerd-s 1
2 3251811 root containerd-shim /usr/local/bin/containerd-s 1
2 19200 root containerd-shim /usr/local/bin/containerd-s 1
2 19156 root containerd-shim /usr/local/bin/containerd-s 1
1 3602946 root containerd-shim /usr/local/bin/containerd-s 1
1 3171279 root containerd-shim /usr/local/bin/containerd-s 1
1 213009 root containerd /usr/local/bin/containerd 1

Steps to reproduce the issue:

  1. Setup containerd on linux machine with cgroups v2
  2. List the inotify instances held by the containerd-shims and observe over a period of time

Describe the results you received:
The inotify instances are increasing in an unbounded manner causing concern that the fs.inotify.max_user_instances would be breached over time. The impact on containerd-shim when the limit breaches is not known to me right now. If this were in kubernetes, 'kubectl logs -f' to tail logs on any pod on that node would fail unable to create a new inotify instance.

Describe the results you expected:
The number of instances should be bounded and should not increase/leak.

What version of containerd are you using:

$ containerd --version
containerd github.com/containerd/containerd v1.5.0 8c906ff108ac28da23f69cc7b74f8e7a470d1df0

Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.):

runc --version
$ runc --version
runc version 1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.16.3
libseccomp: 2.3.3
uname -a
$ uname -a
Linux sparrow-hyd-playground-1-fk-prod-nodes-4-8340279 5.10.0-0.bpo.3-cloud-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux
@bharathguvvala
Copy link
Author

@AkihiroSuda @Random-Liu Can someone help me debug this?

@AkihiroSuda
Copy link
Member

Thanks for reporting, seems happening around here https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L551-L640

Do you know whether the issue is specific to Kubernetes? How long minutes does it take to hit the issue?

@bharathguvvala
Copy link
Author

bharathguvvala commented Jun 29, 2021

@AkihiroSuda Thanks for the pointer.

Thanks for reporting, seems happening around here https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L551-L640

Do you know whether the issue is specific to Kubernetes? How long minutes does it take to hit the issue?

Not sure if this is specific to kubernetes. It doesn't take long to observe this issue since we have very many number of pods running per node. However I have noticed that when a container crashes the count goes up by 1 suggesting that the previous inotify instance was not reaped.

@mikebrow
Copy link
Member

mikebrow commented Jun 29, 2021

https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L578-L579
order of operations for defer.. close is currently first (because golang processes them in reverse order)... I wouldn't think we can remove the watch after the fd has been closed because the fd can be reused once closed..?

@bharathguvvala
Copy link
Author

bharathguvvala commented Jun 29, 2021

https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L578-L579
order of operations for defer.. close is currently first (because golang processes them in reverse order)... I wouldn't think we can remove the watch after the fd has been closed because the fd can be reused once closed..?

Quoting from the inotify man page:

"When all file descriptors referring to an inotify instance have been closed (using close(2)), the underlying object and its resources are freed for reuse by the kernel; all
associated watches are automatically freed."

@mikebrow Good catch. But considering the above info, shouldn't the watches be freed automatically as the fd is closed thus making the first defer statement redundant?
In fact the observation is that the FDs are getting leaked.

@mikebrow
Copy link
Member

"When all file descriptors referring to an inotify instance have been closed (using close(2)), the underlying object and its resources are freed for reuse by the kernel; all
associated watches are automatically freed."

nod..

@alakesh
Copy link
Contributor

alakesh commented Jul 14, 2021

@bharathguvvala How are you running containerd with cgroup v2? I did not find a good documentation around that. Trying to repro this issue on my setup.

@bharathguvvala
Copy link
Author

@alakesh Sorry about the delayed response. You can start here. You need to configure the cgroupv2 controller in the corresponding operating system grub (debian example) .

@jepio
Copy link
Contributor

jepio commented Dec 3, 2021

Hi, I tracked this issue down: it's not specific to k8s but any situation where a long-running containerd-shim
restarts containers. Every container restart is one more leaked inotify instance, because the goroutine does not currently detect container exits. This does not happen on cgroup1 because the associated eventfd is signalled when container exits and released here https://github.com/containerd/containerd/blob/main/pkg/oom/v1/v1.go#L124-L130.

This is easy to reproduce on current Flatcar releases because we switched to cgroup2. Run k8s and start a crashing container that gets auto-restarted (the manifest from #6323 is enough).

I've prepared a fix here containerd/cgroups#212, which mirrors what we do for cgroup1 (get notified when container exits). I'll be happy to prepare a PR here to update containerd/cgroups when that's merged. We'll also need this change to gracefully handle exits:

commit 6d2d2e7ba3eb3f25d7c4f6c10b025a7636428559 (HEAD -> jepio/fix-cgroupv2-oom-event)
Author: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Date:   Fri Dec 3 13:53:19 2021 +0000

    cgroup2: handle EventChan closure gracefully

    EventChan() now terminates when the cgroup exits, and closes the errorChannel.
    This is expected so check the error and skip logging when error is nil.

    Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>

diff --git a/pkg/oom/v2/v2.go b/pkg/oom/v2/v2.go
index c0717eb99..8b3e0d387 100644
--- a/pkg/oom/v2/v2.go
+++ b/pkg/oom/v2/v2.go
@@ -102,10 +102,12 @@ func (w *watcher) Add(id string, cgx interface{}) error {
                                i.ev = ev
                                w.itemCh <- i
                        case err := <-errCh:
-                               i.err = err
-                               w.itemCh <- i
-                               // we no longer get any event/err when we got an err
-                               logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
+                               if err != nil {
+                                       i.err = err
+                                       w.itemCh <- i
+                                       // we no longer get any event/err when we got an err
+                                       logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
+                               }
                                return
                        }

ffilippopoulos added a commit to utilitywarehouse/tf_kube_ignition that referenced this issue Jan 27, 2022
Helps to mitigate the inotify fd leak issue related with containerd runtime:
containerd/containerd#5670
Created as a flag to be able to support both docker and containerd runtime
setups
ffilippopoulos added a commit to utilitywarehouse/tf_kube_ignition that referenced this issue Jan 27, 2022
Helps to mitigate the inotify fd leak issue related with containerd runtime:
containerd/containerd#5670
Created as a flag to be able to support both docker and containerd runtime
setups
@kzys
Copy link
Member

kzys commented May 19, 2022

It should be fixed by #6498. Thanks @jepio! I'm going to backport that to 1.5.x.

@kzys
Copy link
Member

kzys commented May 19, 2022

#6961 will fix the issue in 1.5.x.

@kzys kzys added the cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch label May 20, 2022
@kzys
Copy link
Member

kzys commented Jun 6, 2022

Resolving.

@kzys kzys closed this as completed Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/1.5.x PR commits are cherry-picked into release/1.5 branch kind/bug priority/P1
Projects
None yet
Development

No branches or pull requests

6 participants