-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Graceful Node Shutdown in Kubelet #96129
Implement Graceful Node Shutdown in Kubelet #96129
Conversation
@bobbypage: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/assign |
6a32d95
to
a8ad991
Compare
Having read the KEP, I have only one real question. If I have a pod that wants 5m of gracePeriod and it lands on a preemptible VM which can only ever deliver 30 seconds grace, should the scheduler not put it there? IOW, do we need a node to declare an upper-bound for grace period that we can respect when scheduling? |
@thockin, thanks for taking a look at the KEP and PR. Regarding the question you’re raising, this is a definitely a completely valid point -- the node doesn’t expose the "supported" grace period to the scheduler, so the scheduler isn’t able to take this into account. My thoughts on this -- today during a node shutdown, gracePeriod on PodSpec is completely ignored by both scheduler and kubelet. There is no graceful node shutdown. This PR improves upon that situation as upon a shutdown with this feature enabled, the pod will get SOME gracePeriod to shutdown gracefully. As a result, I think the result is a net improvement over the current situation. Exposing the supported gracePeriod as part of the NodeSpec up to the scheduler, is something we should consider moving forward for this effort and bring on board scheduling SIG folks to get their thoughts. One thing that comes to mind though, is if the scheduler would suddenly start respecting gracePeriod supported by the node, it could be a breaking change for users. E.g. if my cluster is full of PVMs that support 30 seconds and my pod spec gracePeriod is 5 minutes, with new scheduler behavior, suddenly many pods will be unschedulable. That may or may not be a feature though :) |
ACk that it can be integrated, but I do think we should come back to it.
…On Tue, Nov 3, 2020 at 2:43 PM David Porter ***@***.***> wrote:
@thockin <https://github.com/thockin>, thanks for taking a look at the
KEP and PR.
Regarding the question you’re raising, this is a definitely a completely
valid point -- the node doesn’t expose the “supported” grace period to the
scheduler, so the scheduler isn’t able to take this into account.
My thoughts on this -- today during a node shutdown, gracePeriod on
PodSpec is completely ignored by both scheduler and kubelet. There is no
graceful node shutdown. This PR improves upon that situation as upon a
shutdown with this feature enabled, the pod will get SOME gracePeriod to
shutdown gracefully. As a result, I think the result is a net improvement
over the current situation.
Exposing the supported gracePeriod as part of the NodeSpec up to the
scheduler, is something we should consider moving forward for this effort
and bring on board scheduling SIG folks to get their thoughts.
One thing that comes to mind though, is if the scheduler would suddenly
start respecting gracePeriod supported by the node, it could be a breaking
change for users. E.g. if my cluster is full of PVMs that support 30
seconds and my pod spec gracePeriod is 5 minutes, with new scheduler
behavior, suddenly many pods will be unschedulable. That may or may not be
a feature though :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#96129 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVD4UWJJXEZTSJGMDH3SOCBRVANCNFSM4TICWK2Q>
.
|
/retest |
/lgtm |
/lgtm |
/unhold |
* Add a new package under nodeshutdown "systemd" * Package uses dbus to interface with logind to manage shutdown inhibitors * Make github.com/godbus/dbus a new explicit dependency * Update vendor and go modules
Implements KEP 2000, Graceful Node Shutdown: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown * Add new FeatureGate `GracefulNodeShutdown` to control enabling/disabling the feature * Add two new KubeletConfiguration options * `ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods` * Add new package, `nodeshutdown` that implements the Node shutdown manager * The node shutdown manager uses the systemd inhibit package, to create an system inhibitor, monitor for node shutdown events, and gracefully terminate pods upon a node shutdown.
216eebc
to
16f71c6
Compare
/lgtm |
This reverts commit f094ddf. It didn't actually help, and causes system shutdown to take noticeably longer which makes the MCO tests time out. The real fix will involve backporting kubernetes/kubernetes#96129
This mostly reverts commit f094ddf. It didn't actually help, and causes system shutdown to take noticeably longer which makes the MCO tests time out. The real fix will involve backporting kubernetes/kubernetes#96129 We do continue carry the changes though to update the daemonset if the readiness changes because we're reverting that on upgrades in 4.7 now.
Hello. I'm following up the documentation for this feature; there's a PR open at kubernetes/website#24918 but it's not yet ready for review. |
defer wg.Done() | ||
|
||
var gracePeriodOverride int64 | ||
if kubelettypes.IsCriticalPod(pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented in the KEP https://github.com/kubernetes/enhancements/pull/2001/files#r565507971 but I don't think this is appropriate. It forces an administrator deploying a system infrastructure pod to use one of these two priority classes to get this behavior, which means you can't create new priority classes for your critical infrastructure pods to control the order of eviction.
Also, I expect all static pods to be covered by this logic, because it is impossible to drain a static pod correctly outside the kubelet (since kubelet controls the lifecycle, not an outside entity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-apiserver depends on SDN (for aggregation and for webhooks). If both are implemented as static pods, and SDN is able to go through a LB to reach an apiserver on another node, how can we make sure SDN stays up longer than the static kube-apiserver pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see thread on KEP kubernetes/enhancements#2001 (comment) regarding this discussion
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR implements KEP 2000 (Graceful Node Shutdown).
This PR makes it possible for kubelet to be aware of node shutdown events and gracefully terminate pods during a system shutdown. Refer to the KEP for more details.
This PR adds a new alpha feature gate in kubelet,
GracefulNodeShutdown
and two newKubeletConfiguration
options,ShutdownGracePeriod
andShutdownGracePeriodCriticalPods
.With the feature gate enabled and the kubelet config options set, kubelet can delay a system shutdown by
ShutdownGracePeriod
and terminate pods gracefully prior to the node being shutdown.Which issue(s) this PR fixes:
Fixes #91472
Enhancement issue: kubernetes/enhancements#2000
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md