Add node shutdown KEP #2001

bobbypage · 2020-09-21T21:41:48Z

Enhancement issue: #2000

karan · 2020-09-22T17:06:38Z

/cc @karan

/cc @SergeyKanzhelev

keps/sig-node/2000-graceful-node-shutdown/README.md

keps/sig-node/2000-graceful-node-shutdown/kep.yaml

keps/sig-node/2000-graceful-node-shutdown/README.md

derekwaynecarr · 2020-09-24T14:58:41Z

@bobbypage @mrunalp thanks for putting this together!

I have no major issues with the proposal as-is other than mechanical questions about testing.

I am happy if we want to merge and iterate on just that section of the KEP as the other parts of the KEP all lgtm.

I will let @dchen1107 take a final pass.

/assign @dchen1107

bobbypage · 2020-09-25T18:27:26Z

Thanks @derekwaynecarr for taking a look! Will followup with @dchen1107 for a final pass. I put this KEP on agenda for upcoming SIG-Node, me and @mrunalp are happy to discuss this in more detail there.

keps/sig-node/2000-graceful-node-shutdown/README.md

dchen1107 · 2020-09-30T00:56:35Z

We discussed this KEP at today's SIG node meeting. A couple more feedback raised at the meeting:

Suggested by @marosset: Windows has similar APIs to register for shutdown events and delay shutdown. I'd love to see Windows support added here as well but not sure we'll have resources to work on this at the same time as the Linux implementations. I'll add links to the windows APIs to the doc PR tho.

We discussed it briefly, and we should include Windows Node support in KEP, but not alpha blocker here.

Suggested by me in addition to two issues I raised above (1) 2s for critical system pods 2) reset node the config through kubelet's config): Instead of introducing the new taint, and new node condition, Kubelet could send node status by marking the node NotReady with detail message like: the node is shutting down.

dchen1107 · 2020-09-30T01:03:14Z

@bobbypage I approved your KEP for now. Please address all above comments and ping me for another review.

/approve

andrewsykim · 2020-09-30T01:11:10Z

keps/sig-node/2000-graceful-node-shutdown/README.md

+* Don’t handle node shutdown events at all, and have users drain nodes before
+  shutting them down.
+    * This is not always possible, for example if the shutdown is controlled by
+      some external system (e.g. Preemptible VMs).


Does "Preemptible VMs" in all cloud providers trigger the shutdown event on termination?

(assuming the OS for the VM is running a compatible systemd version)

That's a good question -- every cloud provider will eventually have to shutdown the VM and terminate it, so at some point the VM shutdown event should be sent.

For example, on GCE when a GCE Preemptible VM is terminated it gets 30 second time to shutdown, and the shutdown event is delivered at t-30 where t is when it will be forced shutdown. On AWS spot instances, the period is 2 minutes, but I'm unclear if the shutdown event is delivered a t-2min or at t itself.

In addition to the systemd shutdown event, each cloud provider usually has a specific metadata server local to the VM that can be used to poll for preemption events. If the specific cloud provider doesn't actually trigger shutdown prior to the VM getting preempted, one workaround would be to have a cloud specific daemonset deployed that can poll the metadata server of the VM for preemption event and upon receiving the terminating event, simply trigger a node shutdown, i.e. systemctl poweroff, which will initiate kubelet graceful shutdown.

bobbypage · 2020-09-30T04:51:55Z

Thanks @dchen1107 and @derekwaynecarr and rest of the folks from SIG-Node today for your feedback on this proposal. I will followup to address the comments here and would like to merge this and mark it implementable, so we can get the KEP in before the enhancements freeze for 1.20 which is October 6.

kikisdeliveryservice

I will followup to address the comments here and would like to merge this and mark it implementable, so we can get the KEP in before the enhancements freeze for 1.20 which is October 6.

Hi there, If you want this in 1.20, you need to: Update the related Issue to get it in the milestone, add graduation criteria (alpha, beta, etc) and mark this as implementable.

rata

@bobbypage Thanks for the KEP, LGTM. Added some simple comments and suggestions, but nothing major :)

keps/sig-node/2000-graceful-node-shutdown/README.md

* Change ready status to false during node shutdown * Add note about new KubeletConfig option, `ShutdownGracePeriodCriticalPods`, to configure shutdown gracePeriod for critical pods * Update status to implementable

bobbypage · 2020-10-02T00:14:36Z

/retest

bobbypage · 2020-10-02T00:26:12Z

I've updated the KEP based on the feedback so far (changed to use ReadyStatus and option to configure grace period for critical pods) as mentioned in #2001 (comment) .

I've also updated the KEP and corresponding enhancement issue (#2000) to implementable status targeting 1.20 as discussed during the SIG-Node meeting. Please let me know if there's any other concerns.

Pinging @dchen1107 for final approval.

kikisdeliveryservice

Noted 2 things from an enhancements team POV

keps/sig-node/2000-graceful-node-shutdown/README.md

keps/sig-node/2000-graceful-node-shutdown/kep.yaml

bobbypage · 2020-10-02T08:04:44Z

/retest

dchen1107 · 2020-10-02T18:21:38Z

@bobbypage thanks for addressing our comments except Windows specific ones. Mark agreed to send the followup PRs to the KEP later.

/lgtm
/approve

k8s-ci-robot · 2020-10-02T18:21:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobbypage, dchen1107

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2021-01-27T17:44:21Z

keps/sig-node/2000-graceful-node-shutdown/README.md

+“critical system pods”, and regular pods. Critical system pods should be
+terminated last, because for example, if the logging pod is terminated first,
+logs from the other workloads will not be captured. Critical system pods are
+identified as those that are in the `system-cluster-critical` or


I missed this when it came out, but this is super concerning to me. I don't think these priority classes are "special" in that the Kubelet should hardcode their use. By hardcoding these, we FORCE workloads that interact with the kubelet or other system infra pods to be in these two priority classes, which breaks the orthogonality of scheduling and resource behavior from the kubelet.

Either there needs to be something that selects for priority classes to treat "specially", or these need to be configurable at startup time in the kubelet. The former is more flexible, the latter may be more acceptable, but would not allow a service provider to allow users to make that orthogonal.

@bobbypage while reviewing the KEP (i was reading through it and thinking about the implications of the intersection with grace period when i noticed this), I think this has to be addressed before we go to beta.

Hi @smarterclayton

Thanks so much for providing your feedback and comments.

@mrunalp and I discussed this topic in length as part of KEP design and we decided to use priority class of system-cluster-critical and system-node-critical to separate "core system workloads (e.g logging deamonsets, etc) vs regular pods and use that information to determine shutdown ordering.

Unfortunately as I'm sure you're aware there is no existing declarative mechanism to describe pod shutdown ordering and as such we decided to use pod priority as a signal instead. This is similar to for example pod admission/preemption and OOM score adjustment which also uses IsCriticalPod() as a signal today.

@mrunalp and I definitely agree this is not perfect, and happy to discuss if you have some alternative ideas on how to improve it and a potentially better signal for us to use to determine pod shutdown order logic instead for beta moving forward.

bobbypage · 2021-01-29T21:57:22Z

We had a chat today with @smarterclayton @mrunalp @SergeyKanzhelev regarding some of questions in #2001 (comment)

The main item we discussed was around the current design of having two shutdown phases, first being shutting down user workloads followed by "critical node system workloads" and the current pattern of using system-cluster-critical and system-node-critical to separate system workloads vs user workloads.

Some notes from our discussion:

Currently users who want to get their workload into the later "critical" shutdown phase need to configure their pod spec's priority class to be system-cluster-critical or system-node-critical. However, some pods may want to get into "critical" shutdown phase, but are using some other priority class.
- Does it make sense to make configurable what priority classes are considered critical vs not instead of hardcoding system-cluster-critical and system-node-critical as today?
Node Graceful shutdown currently splits the total shutdown time into two phases, first user workloads, followed by critical pods.
- Does it make sense to provide users ability to partition the shutdown time into more than the existing two phases?
  - Perhaps providing ability to configure a list of priority classes (or maybe range of priorities) to a shutdown phase and an associated amount of time for each phase?

SergeyKanzhelev · 2021-02-02T01:26:03Z

I think one more small piece of feedback was to allow to use leftover of one phase to extend another. I.e. if user pods terminated very quickly, let critical pods to use the rest of the time.

bobbypage · 2021-03-05T22:29:43Z

To circle back on #2001 (comment) regarding supporting "custom" priority classes for node shutdown other than system-cluster-critical and system-node-critical, we discussed this point in SIG-node with some other folks and @mrunalp.

Ultimately we decided that it's not clear how many users are actually making use of custom priority classes and would want to partition the shutdown time per specific priority class, making it a bit of niche requirement and more complicated.

We decided to proceed to beta with the current design. If we'll get more feedback / data that supporting configurable shutdown time per "custom" priority class, we can always add this capability as followup post beta.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 21, 2020

k8s-ci-robot requested review from dchen1107 and derekwaynecarr September 21, 2020 21:42

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 21, 2020

bobbypage force-pushed the node-shutdown-kep branch 2 times, most recently from 7f8e11e to fa57c03 Compare September 21, 2020 21:49

bobbypage mentioned this pull request Sep 21, 2020

Graceful node shutdown #2000

Open

bobbypage force-pushed the node-shutdown-kep branch 4 times, most recently from 9112cdf to 7dde60e Compare September 21, 2020 22:37

k8s-ci-robot requested a review from karan September 22, 2020 17:06

karan reviewed Sep 22, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

karan reviewed Sep 22, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

rata reviewed Sep 22, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

rata reviewed Sep 22, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 23, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/kep.yaml Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 23, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 23, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev reviewed Sep 23, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

derekwaynecarr reviewed Sep 24, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned dchen1107 Sep 24, 2020

bobbypage force-pushed the node-shutdown-kep branch from 7dde60e to 32e4c47 Compare September 24, 2020 23:21

dchen1107 reviewed Sep 29, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

dchen1107 reviewed Sep 29, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

marosset reviewed Sep 29, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2020

andrewsykim reviewed Sep 30, 2020

View reviewed changes

Add initial graceful shutdown KEP

2fb4744

bobbypage force-pushed the node-shutdown-kep branch from 32e4c47 to 2fb4744 Compare September 30, 2020 05:03

kikisdeliveryservice reviewed Oct 1, 2020

View reviewed changes

rata reviewed Oct 1, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

keps/sig-node/2000-graceful-node-shutdown/README.md Outdated Show resolved Hide resolved

bobbypage force-pushed the node-shutdown-kep branch from ee49010 to d2c6556 Compare October 2, 2020 00:12

Update based on feedback

e49f2ca

* Change ready status to false during node shutdown * Add note about new KubeletConfig option, `ShutdownGracePeriodCriticalPods`, to configure shutdown gracePeriod for critical pods * Update status to implementable

bobbypage force-pushed the node-shutdown-kep branch from d2c6556 to e49f2ca Compare October 2, 2020 00:13

kikisdeliveryservice reviewed Oct 2, 2020

View reviewed changes

keps/sig-node/2000-graceful-node-shutdown/README.md Show resolved Hide resolved

keps/sig-node/2000-graceful-node-shutdown/kep.yaml Outdated Show resolved Hide resolved

Add graduation criteria and update alpha milestone

f419e61

bobbypage force-pushed the node-shutdown-kep branch from a80cffb to f419e61 Compare October 2, 2020 08:03

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020

k8s-ci-robot merged commit cd1f616 into kubernetes:master Oct 2, 2020

k8s-ci-robot added this to the v1.20 milestone Oct 2, 2020

github-actions bot mentioned this pull request Oct 6, 2020

Week Ending October 4, 2020 dev-obs/actus#238

Open

smarterclayton reviewed Jan 27, 2021

View reviewed changes

bobbypage mentioned this pull request Jan 29, 2021

Implement Graceful Node Shutdown in Kubelet kubernetes/kubernetes#96129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add node shutdown KEP #2001

Add node shutdown KEP #2001

bobbypage commented Sep 21, 2020 •

edited

karan commented Sep 22, 2020 •

edited

derekwaynecarr commented Sep 24, 2020

bobbypage commented Sep 25, 2020

dchen1107 commented Sep 30, 2020

dchen1107 commented Sep 30, 2020

andrewsykim Sep 30, 2020

andrewsykim Sep 30, 2020

bobbypage Sep 30, 2020 •

edited

bobbypage commented Sep 30, 2020

kikisdeliveryservice left a comment

rata left a comment

bobbypage commented Oct 2, 2020

bobbypage commented Oct 2, 2020 •

edited

kikisdeliveryservice left a comment

bobbypage commented Oct 2, 2020

dchen1107 commented Oct 2, 2020

k8s-ci-robot commented Oct 2, 2020

smarterclayton Jan 27, 2021

smarterclayton Jan 27, 2021

bobbypage Jan 27, 2021 •

edited

bobbypage commented Jan 29, 2021 •

edited

SergeyKanzhelev commented Feb 2, 2021

bobbypage commented Mar 5, 2021

Add node shutdown KEP #2001

Add node shutdown KEP #2001

Conversation

bobbypage commented Sep 21, 2020 • edited

karan commented Sep 22, 2020 • edited

derekwaynecarr commented Sep 24, 2020

bobbypage commented Sep 25, 2020

dchen1107 commented Sep 30, 2020

dchen1107 commented Sep 30, 2020

andrewsykim Sep 30, 2020

Choose a reason for hiding this comment

andrewsykim Sep 30, 2020

Choose a reason for hiding this comment

bobbypage Sep 30, 2020 • edited

Choose a reason for hiding this comment

bobbypage commented Sep 30, 2020

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

rata left a comment

Choose a reason for hiding this comment

bobbypage commented Oct 2, 2020

bobbypage commented Oct 2, 2020 • edited

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

bobbypage commented Oct 2, 2020

dchen1107 commented Oct 2, 2020

k8s-ci-robot commented Oct 2, 2020

smarterclayton Jan 27, 2021

Choose a reason for hiding this comment

smarterclayton Jan 27, 2021

Choose a reason for hiding this comment

bobbypage Jan 27, 2021 • edited

Choose a reason for hiding this comment

bobbypage commented Jan 29, 2021 • edited

SergeyKanzhelev commented Feb 2, 2021

bobbypage commented Mar 5, 2021

bobbypage commented Sep 21, 2020 •

edited

karan commented Sep 22, 2020 •

edited

bobbypage Sep 30, 2020 •

edited

bobbypage commented Oct 2, 2020 •

edited

bobbypage Jan 27, 2021 •

edited

bobbypage commented Jan 29, 2021 •

edited