Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Graceful Node Shutdown in Kubelet #96129

Merged
merged 2 commits into from
Nov 12, 2020

Conversation

bobbypage
Copy link
Member

@bobbypage bobbypage commented Nov 2, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:

This PR implements KEP 2000 (Graceful Node Shutdown).

This PR makes it possible for kubelet to be aware of node shutdown events and gracefully terminate pods during a system shutdown. Refer to the KEP for more details.

This PR adds a new alpha feature gate in kubelet, GracefulNodeShutdown and two new KubeletConfiguration options, ShutdownGracePeriod and ShutdownGracePeriodCriticalPods.

With the feature gate enabled and the kubelet config options set, kubelet can delay a system shutdown by ShutdownGracePeriod and terminate pods gracefully prior to the node being shutdown.

Which issue(s) this PR fixes:

Fixes #91472
Enhancement issue: kubernetes/enhancements#2000

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Adds kubelet alpha feature, `GracefulNodeShutdown` which makes kubelet aware of node system shutdowns and result in graceful termination of pods during a system shutdown.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2020
@k8s-ci-robot
Copy link
Contributor

@bobbypage: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 2, 2020
@bobbypage
Copy link
Member Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/dependency Issues or PRs related to dependency changes area/kubelet kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2020
@k8s-ci-robot k8s-ci-robot requested review from dchen1107, mtaufen and a team November 2, 2020 23:55
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@bobbypage
Copy link
Member Author

/cc @SergeyKanzhelev @mrunalp

@SergeyKanzhelev
Copy link
Member

/assign

@bobbypage bobbypage force-pushed the graceful-node-shutdown branch 2 times, most recently from 6a32d95 to a8ad991 Compare November 3, 2020 04:32
@thockin
Copy link
Member

thockin commented Nov 3, 2020

Having read the KEP, I have only one real question. If I have a pod that wants 5m of gracePeriod and it lands on a preemptible VM which can only ever deliver 30 seconds grace, should the scheduler not put it there?

IOW, do we need a node to declare an upper-bound for grace period that we can respect when scheduling?

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
@bobbypage
Copy link
Member Author

bobbypage commented Nov 3, 2020

@thockin, thanks for taking a look at the KEP and PR.

Regarding the question you’re raising, this is a definitely a completely valid point -- the node doesn’t expose the "supported" grace period to the scheduler, so the scheduler isn’t able to take this into account.

My thoughts on this -- today during a node shutdown, gracePeriod on PodSpec is completely ignored by both scheduler and kubelet. There is no graceful node shutdown. This PR improves upon that situation as upon a shutdown with this feature enabled, the pod will get SOME gracePeriod to shutdown gracefully. As a result, I think the result is a net improvement over the current situation.

Exposing the supported gracePeriod as part of the NodeSpec up to the scheduler, is something we should consider moving forward for this effort and bring on board scheduling SIG folks to get their thoughts.

One thing that comes to mind though, is if the scheduler would suddenly start respecting gracePeriod supported by the node, it could be a breaking change for users. E.g. if my cluster is full of PVMs that support 30 seconds and my pod spec gracePeriod is 5 minutes, with new scheduler behavior, suddenly many pods will be unschedulable. That may or may not be a feature though :)

@thockin
Copy link
Member

thockin commented Nov 4, 2020 via email

@bobbypage
Copy link
Member Author

/retest

@mrunalp
Copy link
Contributor

mrunalp commented Nov 12, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@dchen1107
Copy link
Member

/lgtm

@bobbypage
Copy link
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
* Add a new package under nodeshutdown "systemd"
  * Package uses dbus to interface with logind to manage shutdown
  inhibitors
* Make github.com/godbus/dbus a new explicit dependency
  * Update vendor and go modules
Implements KEP 2000, Graceful Node Shutdown:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown

* Add new FeatureGate `GracefulNodeShutdown` to control
enabling/disabling the feature
* Add two new KubeletConfiguration options
  * `ShutdownGracePeriod` and `ShutdownGracePeriodCriticalPods`
* Add new package, `nodeshutdown` that implements the Node shutdown
manager
  * The node shutdown manager uses the systemd inhibit package, to
  create an system inhibitor, monitor for node shutdown events, and
  gracefully terminate pods upon a node shutdown.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@mrunalp
Copy link
Contributor

mrunalp commented Nov 12, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2020
@k8s-ci-robot k8s-ci-robot merged commit b2dc35d into kubernetes:master Nov 12, 2020
cgwalters added a commit to cgwalters/cluster-dns-operator that referenced this pull request Nov 13, 2020
This reverts commit f094ddf.
It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

The real fix will involve backporting
kubernetes/kubernetes#96129
cgwalters added a commit to cgwalters/cluster-dns-operator that referenced this pull request Nov 18, 2020
This mostly reverts commit f094ddf.
It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

The real fix will involve backporting
kubernetes/kubernetes#96129

We do continue carry the changes though to update the daemonset
if the readiness changes because we're reverting that on upgrades
in 4.7 now.
@sftim
Copy link
Contributor

sftim commented Nov 19, 2020

Hello. I'm following up the documentation for this feature; there's a PR open at kubernetes/website#24918 but it's not yet ready for review.

defer wg.Done()

var gracePeriodOverride int64
if kubelettypes.IsCriticalPod(pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in the KEP https://github.com/kubernetes/enhancements/pull/2001/files#r565507971 but I don't think this is appropriate. It forces an administrator deploying a system infrastructure pod to use one of these two priority classes to get this behavior, which means you can't create new priority classes for your critical infrastructure pods to control the order of eviction.

Also, I expect all static pods to be covered by this logic, because it is impossible to drain a static pod correctly outside the kubelet (since kubelet controls the lifecycle, not an outside entity).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-apiserver depends on SDN (for aggregation and for webhooks). If both are implemented as static pods, and SDN is able to go through a LB to reach an apiserver on another node, how can we make sure SDN stays up longer than the static kube-apiserver pod?

Copy link
Member Author

@bobbypage bobbypage Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see thread on KEP kubernetes/enhancements#2001 (comment) regarding this discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/dependency Issues or PRs related to dependency changes area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubelet should handle node shutdown