Implement Graceful Node Shutdown in Kubelet #96129

bobbypage · 2020-11-02T23:53:53Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

This PR implements KEP 2000 (Graceful Node Shutdown).

This PR makes it possible for kubelet to be aware of node shutdown events and gracefully terminate pods during a system shutdown. Refer to the KEP for more details.

This PR adds a new alpha feature gate in kubelet, GracefulNodeShutdown and two new KubeletConfiguration options, ShutdownGracePeriod and ShutdownGracePeriodCriticalPods.

With the feature gate enabled and the kubelet config options set, kubelet can delay a system shutdown by ShutdownGracePeriod and terminate pods gracefully prior to the node being shutdown.

Which issue(s) this PR fixes:

Fixes #91472
Enhancement issue: kubernetes/enhancements#2000

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Adds kubelet alpha feature, `GracefulNodeShutdown` which makes kubelet aware of node system shutdowns and result in graceful termination of pods during a system shutdown.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md

k8s-ci-robot · 2020-11-02T23:54:01Z

@bobbypage: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bobbypage · 2020-11-02T23:54:19Z

/sig node

fejta-bot · 2020-11-03T00:00:20Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

bobbypage · 2020-11-03T00:00:55Z

/cc @SergeyKanzhelev @mrunalp

SergeyKanzhelev · 2020-11-03T00:04:23Z

/assign

thockin · 2020-11-03T21:54:18Z

Having read the KEP, I have only one real question. If I have a pod that wants 5m of gracePeriod and it lands on a preemptible VM which can only ever deliver 30 seconds grace, should the scheduler not put it there?

IOW, do we need a node to declare an upper-bound for grace period that we can respect when scheduling?

pkg/kubelet/apis/config/validation/validation.go

pkg/kubelet/kubelet.go

bobbypage · 2020-11-03T22:43:23Z

@thockin, thanks for taking a look at the KEP and PR.

Regarding the question you’re raising, this is a definitely a completely valid point -- the node doesn’t expose the "supported" grace period to the scheduler, so the scheduler isn’t able to take this into account.

My thoughts on this -- today during a node shutdown, gracePeriod on PodSpec is completely ignored by both scheduler and kubelet. There is no graceful node shutdown. This PR improves upon that situation as upon a shutdown with this feature enabled, the pod will get SOME gracePeriod to shutdown gracefully. As a result, I think the result is a net improvement over the current situation.

Exposing the supported gracePeriod as part of the NodeSpec up to the scheduler, is something we should consider moving forward for this effort and bring on board scheduling SIG folks to get their thoughts.

One thing that comes to mind though, is if the scheduler would suddenly start respecting gracePeriod supported by the node, it could be a breaking change for users. E.g. if my cluster is full of PVMs that support 30 seconds and my pod spec gracePeriod is 5 minutes, with new scheduler behavior, suddenly many pods will be unschedulable. That may or may not be a feature though :)

thockin · 2020-11-04T00:45:09Z

ACk that it can be integrated, but I do think we should come back to it.

…

On Tue, Nov 3, 2020 at 2:43 PM David Porter ***@***.***> wrote: @thockin <https://github.com/thockin>, thanks for taking a look at the KEP and PR. Regarding the question you’re raising, this is a definitely a completely valid point -- the node doesn’t expose the “supported” grace period to the scheduler, so the scheduler isn’t able to take this into account. My thoughts on this -- today during a node shutdown, gracePeriod on PodSpec is completely ignored by both scheduler and kubelet. There is no graceful node shutdown. This PR improves upon that situation as upon a shutdown with this feature enabled, the pod will get SOME gracePeriod to shutdown gracefully. As a result, I think the result is a net improvement over the current situation. Exposing the supported gracePeriod as part of the NodeSpec up to the scheduler, is something we should consider moving forward for this effort and bring on board scheduling SIG folks to get their thoughts. One thing that comes to mind though, is if the scheduler would suddenly start respecting gracePeriod supported by the node, it could be a breaking change for users. E.g. if my cluster is full of PVMs that support 30 seconds and my pod spec gracePeriod is 5 minutes, with new scheduler behavior, suddenly many pods will be unschedulable. That may or may not be a feature though :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#96129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVD4UWJJXEZTSJGMDH3SOCBRVANCNFSM4TICWK2Q> .

bobbypage · 2020-11-12T18:07:47Z

/retest

mrunalp · 2020-11-12T20:22:46Z

/lgtm

dchen1107 · 2020-11-12T20:31:23Z

/lgtm

bobbypage · 2020-11-12T20:39:11Z

/unhold

* Add a new package under nodeshutdown "systemd" * Package uses dbus to interface with logind to manage shutdown inhibitors * Make github.com/godbus/dbus a new explicit dependency * Update vendor and go modules

Implements KEP 2000, Graceful Node Shutdown: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown * Add new FeatureGate `GracefulNodeShutdown` to control enabling/disabling the feature * Add two new KubeletConfiguration options * `ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods` * Add new package, `nodeshutdown` that implements the Node shutdown manager * The node shutdown manager uses the systemd inhibit package, to create an system inhibitor, monitor for node shutdown events, and gracefully terminate pods upon a node shutdown.

mrunalp · 2020-11-12T21:49:16Z

/lgtm

This reverts commit f094ddf. It didn't actually help, and causes system shutdown to take noticeably longer which makes the MCO tests time out. The real fix will involve backporting kubernetes/kubernetes#96129

This mostly reverts commit f094ddf. It didn't actually help, and causes system shutdown to take noticeably longer which makes the MCO tests time out. The real fix will involve backporting kubernetes/kubernetes#96129 We do continue carry the changes though to update the daemonset if the readiness changes because we're reverting that on upgrades in 4.7 now.

sftim · 2020-11-19T12:50:21Z

Hello. I'm following up the documentation for this feature; there's a PR open at kubernetes/website#24918 but it's not yet ready for review.

smarterclayton · 2021-01-27T17:48:06Z

pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

+			defer wg.Done()
+
+			var gracePeriodOverride int64
+			if kubelettypes.IsCriticalPod(pod) {


I commented in the KEP https://github.com/kubernetes/enhancements/pull/2001/files#r565507971 but I don't think this is appropriate. It forces an administrator deploying a system infrastructure pod to use one of these two priority classes to get this behavior, which means you can't create new priority classes for your critical infrastructure pods to control the order of eviction.

Also, I expect all static pods to be covered by this logic, because it is impossible to drain a static pod correctly outside the kubelet (since kubelet controls the lifecycle, not an outside entity).

kube-apiserver depends on SDN (for aggregation and for webhooks). If both are implemented as static pods, and SDN is able to go through a LB to reach an apiserver on another node, how can we make sure SDN stays up longer than the static kube-apiserver pod?

see thread on KEP kubernetes/enhancements#2001 (comment) regarding this discussion

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 2, 2020

k8s-ci-robot requested review from dchen1107, mtaufen and a team November 2, 2020 23:55

k8s-ci-robot requested review from mrunalp and SergeyKanzhelev November 3, 2020 00:00

k8s-ci-robot assigned SergeyKanzhelev Nov 3, 2020

bobbypage force-pushed the graceful-node-shutdown branch 2 times, most recently from 6a32d95 to a8ad991 Compare November 3, 2020 04:32

mrunalp reviewed Nov 3, 2020

View reviewed changes

pkg/kubelet/apis/config/validation/validation.go Outdated Show resolved Hide resolved

mrunalp reviewed Nov 3, 2020

View reviewed changes

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020

bobbypage added 2 commits November 12, 2020 21:46

Add systemd package to interface with dbus

2343689

* Add a new package under nodeshutdown "systemd" * Package uses dbus to interface with logind to manage shutdown inhibitors * Make github.com/godbus/dbus a new explicit dependency * Update vendor and go modules

bobbypage force-pushed the graceful-node-shutdown branch from 216eebc to 16f71c6 Compare November 12, 2020 21:48

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020

k8s-ci-robot merged commit b2dc35d into kubernetes:master Nov 12, 2020

cgwalters mentioned this pull request Nov 13, 2020

Revert "Configure CoreDNS to shut down gracefully" openshift/cluster-dns-operator#213

Merged

github-actions bot mentioned this pull request Nov 18, 2020

Week Ending November 15, 2020 dev-obs/actus#273

Open

umialpha mentioned this pull request Jan 13, 2021

add DaemonSet eviction option for empty nodes kubernetes/autoscaler#3778

Closed

smarterclayton reviewed Jan 27, 2021

View reviewed changes

cgwalters mentioned this pull request Feb 1, 2021

daemon: block reboots during upgrades openshift/machine-config-operator#2381

Closed

yboaron mentioned this pull request Feb 3, 2021

New method for providing configurable self-hosted LB/DNS/VIP for on-prem openshift/enhancements#524

Closed

openshift-ci-robot mentioned this pull request Mar 15, 2021

Cherry pick gracefulshutdown openshift/kubernetes#617

Closed

rphillips mentioned this pull request Apr 13, 2021

kubelet: add graceful shutdown events #101081

Merged

mszostok mentioned this pull request Aug 3, 2021

Periodic cluster integration tests fail on testing that proper policy was injected capactio/capact#420

Closed

aclevername mentioned this pull request Sep 22, 2021

Delete daemonset(s) as part of cluster deletion eksctl-io/eksctl#4214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Graceful Node Shutdown in Kubelet #96129

Implement Graceful Node Shutdown in Kubelet #96129

bobbypage commented Nov 2, 2020 •

edited

k8s-ci-robot commented Nov 2, 2020

bobbypage commented Nov 2, 2020

fejta-bot commented Nov 3, 2020

bobbypage commented Nov 3, 2020

SergeyKanzhelev commented Nov 3, 2020

thockin commented Nov 3, 2020

bobbypage commented Nov 3, 2020 •

edited

thockin commented Nov 4, 2020 via email

bobbypage commented Nov 12, 2020

mrunalp commented Nov 12, 2020

dchen1107 commented Nov 12, 2020

bobbypage commented Nov 12, 2020

mrunalp commented Nov 12, 2020

sftim commented Nov 19, 2020

smarterclayton Jan 27, 2021

sttts Jan 28, 2021

bobbypage Jan 29, 2021 •

edited

Implement Graceful Node Shutdown in Kubelet #96129

Implement Graceful Node Shutdown in Kubelet #96129

Conversation

bobbypage commented Nov 2, 2020 • edited

k8s-ci-robot commented Nov 2, 2020

bobbypage commented Nov 2, 2020

fejta-bot commented Nov 3, 2020

bobbypage commented Nov 3, 2020

SergeyKanzhelev commented Nov 3, 2020

thockin commented Nov 3, 2020

bobbypage commented Nov 3, 2020 • edited

thockin commented Nov 4, 2020 via email

bobbypage commented Nov 12, 2020

mrunalp commented Nov 12, 2020

dchen1107 commented Nov 12, 2020

bobbypage commented Nov 12, 2020

mrunalp commented Nov 12, 2020

sftim commented Nov 19, 2020

smarterclayton Jan 27, 2021

Choose a reason for hiding this comment

sttts Jan 28, 2021

Choose a reason for hiding this comment

bobbypage Jan 29, 2021 • edited

Choose a reason for hiding this comment

bobbypage commented Nov 2, 2020 •

edited

bobbypage commented Nov 3, 2020 •

edited

bobbypage Jan 29, 2021 •

edited