New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NRI plugin registration can trigger a deadlock #10085
Comments
/cc @bobbypage |
Moved to containerd/containerd since this bug affects released versions of containerd. |
A possible fix we have discussed is to move the adaptation
This would ensure that the locking order in That locking order makes it consistent with the locking order in the StartContainer flow which is also nri lock -> adaptation lock which should prevent the deadlock... Possible patch...
|
I have a repro of the deadlock:
Here is stack trace from the repro (obtained via Same as mentioned in original comment, we have:
and many other containers stuck in acquiring the NRI lock:
|
During NRI external plugin registration: * acceptPluginConnections() is called * adaptation lock from `nri/pkg/adaptation` is acquired * `syncFn` is invoked * `syncFn` acquires NRI lock in `pkg/nri/nri.go` During container lifecycle events such as `ContainerStart` * NRI lock is acquired in pkg/nri.go * adaptation lock is acquired in `StateChange()` in `nri/pkg/adaptation` As a result, the locking order during NRI plugin registration is: * adaptation lock -> NRI lock While the locking order during container starts is: * NRI lock -> adaptation lock Due the fact that the locking order is inverted and not consistent, it it possible to encounter a deadlock. To fix the issue, during NRI plugin registration, first acquire the NRI lock (done via `syncFn` call) and only after acquire the adaptation lock. This ensures that NRI plugin registration the locking order is adaption lock -> NRI lock, which is consistent with the locking order during container lifecycle events. Fixes containerd/containerd#10085
During NRI external plugin registration: * acceptPluginConnections() is called * adaptation lock from `nri/pkg/adaptation` is acquired * `syncFn` is invoked * `syncFn` acquires NRI lock in `pkg/nri/nri.go` During container lifecycle events such as `ContainerStart` * NRI lock is acquired in pkg/nri.go * adaptation lock is acquired in `StateChange()` in `nri/pkg/adaptation` As a result, the locking order during NRI plugin registration is: * adaptation lock -> NRI lock While the locking order during container starts is: * NRI lock -> adaptation lock Due the fact that the locking order is inverted and not consistent, it it possible to encounter a deadlock. To fix the issue, during NRI plugin registration, first acquire the NRI lock (done via `syncFn` call) and only after acquire the adaptation lock. This ensures that NRI plugin registration the locking order is adaption lock -> NRI lock, which is consistent with the locking order during container lifecycle events. Fixes containerd/containerd#10085 Signed-off-by: David Porter <porterdavid@google.com>
I applied the proposed patch in #10085 (comment) and reran the repro steps with custom built containerd and have not been able to repro the deadlock. Opened containerd/nri#79 with the proposed patch to get feedback. |
During NRI external plugin registration: * `acceptPluginConnections()` is called * adaptation lock from `nri/pkg/adaptation` is acquired * `syncFn` is invoked * `syncFn` acquires NRI lock in `pkg/nri/nri.go` During container lifecycle events such as `ContainerStart` * NRI lock is acquired in pkg/nri.go * adaptation lock is acquired in `StateChange()` in `nri/pkg/adaptation` As a result, the locking order during NRI plugin registration is: * adaptation lock -> NRI lock While the locking order during container starts is: * NRI lock -> adaptation lock Due the fact that the locking order is inverted and not consistent, it it possible to encounter a deadlock. To fix the issue, during NRI plugin registration, first acquire the NRI lock (done via `syncFn` call) and only after acquire the adaptation lock. This ensures that NRI plugin registration the locking order is adaption lock -> NRI lock, which is consistent with the locking order during container lifecycle events. Fixes containerd/containerd#10085 Signed-off-by: David Porter <porterdavid@google.com>
Let's leave this open and use a PR in this repo that updates the nri import to be the closing PR for the bug |
Thank you @samuelkarp and @estesp for the reviews! Next step -- we need a new cut of containerd/nri... Can maintainers please help to get one started? Thank you! |
Fixes containerd#10085 Signed-off-by: Samuel Karp <samuelkarp@google.com>
Lost track of this issue, but yes this was done yesterday: https://github.com/containerd/nri/releases/tag/v0.6.1 Dependency bumped in main: #10089 I'll open a cherry-pick to 1.7 shortly. |
Fixes containerd#10085 Signed-off-by: Samuel Karp <samuelkarp@google.com> (cherry picked from commit a153b2c) Signed-off-by: Samuel Karp <samuelkarp@google.com>
Thanks @samuelkarp for doing the dependency update. Since this is now fixed on main and in 1.7 with #10097, I think we can close this out. This fix will go into next containerd 1.7 release, so 1.7.16+. |
Yep, I was keeping it open until we have a new 1.7 release out. |
We have a plugin that is registered to containerd externally (as opposed to being pre-registered). This plugin is deployed as a k8s DaemonSet.
We've detected a deadlock, in version containerd v1.7.3 (which uses containerd/nri 0.4.0). It looks to still be unfixed.
There are two involved locks: the adaptation.go lock, and the nri.go lock.
The deadlock can happen because these independent routines acquire the locks in inverse order from each other:
StartContainer
can occur in which the nri.go lock is acquired which goes through here and attempts to acquire the adaptation.go lock. Other events do exactly the same, so it's not limited toStartContainer
.The stack traces that confirm this are below.
The plugin registration stack trace:
The StartContainer stack trace:
The effects of this bug are the plugin is stuck (
Synchronize
callback is never invoked) and containerd is unable to process certain events (such asStartContainer
). The only remedy appears to be restarting containerd.The text was updated successfully, but these errors were encountered: