KEP-3085: promote to beta in 1.29 #4139

kannon92 · 2023-07-28T21:22:46Z

One-line PR description: Promote KEP 3085 to beta in 1.29.

Issue link: [KEP-3085] Add condition for sandbox creation (xposted from original issue) #4138

Other comments:

bart0sh · 2023-07-31T11:22:10Z

/lgtm
/assign @mrunalp @dchen1107 @derekwaynecarr

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md

kannon92 · 2023-08-31T14:36:27Z

/assign @ddebroy

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md

kannon92 · 2023-09-26T17:55:03Z

/cc @rphillips @haircommander

BenTheElder · 2023-09-29T21:23:33Z

/assign

jeremyrickard · 2023-10-02T02:12:17Z

The PRR for this needs to be updated a little bit for beta. There is one section defined under scalability that needs to be added to the KEP (https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc).

While reviewing the previous PRR template, I had a few questions. I can't comment since you didn't update the PRR template, so I'll ask below:

How can a rollout or rollback fail? Can it impact already running workloads?

This flag is only relevant for the Kubelet. Therefore, the new condition will be
reported for pods scheduled on nodes that have the feature enabled.
A controller or service that consumes the new pod condition should be enabled
only after rollout of the new condition has succeeded on all nodes. Similarly,
the controller or service that consumes the new pod condition should be disabled
before the rollback. This helps prevent a controller/service consuming the
condition getting the data from pods running in a subset of the nodes in the
middle of a rollout or rollback.

What happens if someone doesn't do the "should" statements above? When controllers that consume the condition get data in the middle of a rollout or rollback is there a way to determine this or are there any ways to help determine this?

What specific metrics should inform a rollback?

A sharp increase in the number of PATCH requests to API Server from Kubelets
after enabling this feature is a sign of potential problem and can inform a
rollback. A cluster operator may monitor

apiserver_request_total{verb="PATCH", resource="pods", subresource="status"}

for this.
This may be the case in clusters that use a special runtime environment like
microVM/Kata, where the sandbox may crash repeatedly (without ever getting a
chance to start containers) resulting in lots of potential updates due to the
new condition "flapping". However, in such environments, this may already be the
case with existing pod conditions like ContainersReady and Ready (unless the
sandbox environment/VM crashes very early before a single container is run).
Batching of pod status updates from the Kubelet status manager will also help.

Could we better define a sharp increase? This seems like a bad metric for those other enviroments. For people operating clusters with those special runtime environments, is there any way to know that this feature is functioning properly?

How can someone using this feature know that it is working for their instance?

The check of PodReadyToStartContainers seems good for the case where things are working correctly. If this is not set, is this only because there are failures updating etcd? The troubleshooting section below doesn't call out any additional failure modes, are there any other ways that could fail that were identified while this was in the alpha state?

kannon92 · 2023-10-02T14:42:12Z

The PRR for this needs to be updated a little bit for beta. There is one section defined under scalability that needs to be added to the KEP (https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc).

Ah. I will add that.

While reviewing the previous PRR template, I had a few questions. I can't comment since you didn't update the PRR template, so I'll ask below:

What happens if someone doesn't do the "should" statements above? When controllers that consume the condition get data in the middle of a rollout or rollback is there a way to determine this or are there any ways to help determine this?

I'll add an update for this. Basically new pods will add this condition if the feature gate is on. If the feature gate is turned off, that condition will not be updated.

If a service/controller relies on this condition and it doesn't existing or is not being updated, then it will most likely not report anything related to this condition.

What specific metrics should inform a rollback?

A sharp increase in the number of PATCH requests to API Server from Kubelets
after enabling this feature is a sign of potential problem and can inform a
rollback. A cluster operator may monitor
apiserver_request_total{verb="PATCH", resource="pods", subresource="status"}
for this.
This may be the case in clusters that use a special runtime environment like
microVM/Kata, where the sandbox may crash repeatedly (without ever getting a
chance to start containers) resulting in lots of potential updates due to the
new condition "flapping". However, in such environments, this may already be the
case with existing pod conditions like ContainersReady and Ready (unless the
sandbox environment/VM crashes very early before a single container is run).
Batching of pod status updates from the Kubelet status manager will also help.

Could we better define a sharp increase? This seems like a bad metric for those other enviroments. For people operating clusters with those special runtime environments, is there any way to know that this feature is functioning properly?

I don't know the answer to this one. I think it depends on what is the baseline for this metric for their instance. The main idea of this question/metric will be to report that there could be way more patches than what is expected. On a chatty cluster, I could see this being large. So it'd expect problems could be cases where the sandbox is condition is being set to true and then back to false and repeat.

How can someone using this feature know that it is working for their instance?

The check of PodReadyToStartContainers seems good for the case where things are working correctly. If this is not set, is this only because there are failures updating etcd? The troubleshooting section below doesn't call out any additional failure modes, are there any other ways that could fail that were identified while this was in the alpha state?

Yes. We didn't see any other failures in the alpha state.

haircommander · 2023-10-02T19:53:53Z

/lgtm

SergeyKanzhelev

/lgtm
/approve

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md

kannon92 · 2023-10-03T16:46:24Z

@jeremyrickard You bring up an interesting point about mid rollouts.

If a user disables a feature, should we remove the condition entirely from the pod?

I can add that has something we add for beta if need be. I hope this doesn't block the promotion.

johnbelamaric · 2023-10-04T22:56:31Z

@jeremyrickard You bring up an interesting point about mid rollouts.

If a user disables a feature, should we remove the condition entirely from the pod?

I can add that has something we add for beta if need be. I hope this doesn't block the promotion.

So, you have already documented that the conditions will be "frozen" after disablement. We obviously do have a timestamp on conditions. We can also patch them manually. What I suggest is that you document how you would "clear" the conditions after disablement. We should not have special clean up code that runs automatically on disablement (because it might be broken too!). Instead, just document how to do the cleanup. If it seems really necessary, some script or small client side binary could be provided to do clean up.

Please do this in a follow up PR (I won't hold up merge for this).

/approve
/lgtm

k8s-ci-robot · 2023-10-04T22:56:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, kannon92, mrunalp, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [johnbelamaric,mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 28, 2023

k8s-ci-robot requested a review from dchen1107 July 28, 2023 21:22

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jul 28, 2023

k8s-ci-robot requested a review from mrunalp July 28, 2023 21:22

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 28, 2023

kannon92 mentioned this pull request Jul 28, 2023

[KEP-3085] Add condition for sandbox creation (xposted from original issue) #4138

Open

20 tasks

bart0sh added this to Triage in SIG Node PR Triage Jul 31, 2023

k8s-ci-robot assigned dchen1107, derekwaynecarr, mrunalp and bart0sh Jul 31, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2023

bart0sh moved this from Triage to Needs Approver in SIG Node PR Triage Jul 31, 2023

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 1, 2023

kannon92 commented Aug 2, 2023

View reviewed changes

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md Outdated Show resolved Hide resolved

kannon92 commented Aug 23, 2023

View reviewed changes

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md Outdated Show resolved Hide resolved

kannon92 commented Aug 23, 2023

View reviewed changes

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned ddebroy Aug 31, 2023

SergeyKanzhelev reviewed Sep 15, 2023

View reviewed changes

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md Show resolved Hide resolved

SergeyKanzhelev moved this from Needs Approver to Waiting on Author in SIG Node PR Triage Sep 15, 2023

kannon92 moved this from Waiting on Author to Needs Approver in SIG Node PR Triage Sep 21, 2023

kannon92 force-pushed the 3085-beta-promotion branch from b45c5e6 to 0221c9d Compare September 21, 2023 21:31

kannon92 requested a review from SergeyKanzhelev September 26, 2023 16:47

k8s-ci-robot requested a review from haircommander September 26, 2023 17:55

k8s-ci-robot assigned johnbelamaric Sep 29, 2023

k8s-ci-robot assigned BenTheElder Sep 29, 2023

BenTheElder removed their assignment Sep 29, 2023

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2023

SergeyKanzhelev approved these changes Oct 3, 2023

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Oct 3, 2023

wojtek-t reviewed Oct 3, 2023

View reviewed changes

keps/sig-node/3085-pod-conditions-for-starting-completition-of-sandbox-creation/README.md Show resolved Hide resolved

kannon92 force-pushed the 3085-beta-promotion branch from d371139 to 68af10d Compare October 3, 2023 19:30

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2023

kannon92 force-pushed the 3085-beta-promotion branch from 68af10d to 81fa3e0 Compare October 3, 2023 22:26

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 3, 2023

kannon92 force-pushed the 3085-beta-promotion branch 2 times, most recently from bfc3b53 to d4333ce Compare October 4, 2023 17:48

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Oct 4, 2023

promote 3085 to beta in 1.29

31280ee

kannon92 force-pushed the 3085-beta-promotion branch from d4333ce to 31280ee Compare October 4, 2023 17:52

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2023

k8s-ci-robot merged commit fd85a2f into kubernetes:master Oct 4, 2023
4 checks passed

SIG Node PR Triage automation moved this from Needs Approver to Done Oct 4, 2023

k8s-ci-robot added this to the v1.29 milestone Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-3085: promote to beta in 1.29 #4139

KEP-3085: promote to beta in 1.29 #4139

kannon92 commented Jul 28, 2023

bart0sh commented Jul 31, 2023

kannon92 commented Aug 31, 2023

kannon92 commented Sep 26, 2023

BenTheElder commented Sep 29, 2023

jeremyrickard commented Oct 2, 2023

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

How can someone using this feature know that it is working for their instance?

kannon92 commented Oct 2, 2023

What specific metrics should inform a rollback?

How can someone using this feature know that it is working for their instance?

haircommander commented Oct 2, 2023

SergeyKanzhelev left a comment

kannon92 commented Oct 3, 2023

johnbelamaric commented Oct 4, 2023

k8s-ci-robot commented Oct 4, 2023

KEP-3085: promote to beta in 1.29 #4139

KEP-3085: promote to beta in 1.29 #4139

Conversation

kannon92 commented Jul 28, 2023

bart0sh commented Jul 31, 2023

kannon92 commented Aug 31, 2023

kannon92 commented Sep 26, 2023

BenTheElder commented Sep 29, 2023

jeremyrickard commented Oct 2, 2023

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

How can someone using this feature know that it is working for their instance?

kannon92 commented Oct 2, 2023

What specific metrics should inform a rollback?

How can someone using this feature know that it is working for their instance?

haircommander commented Oct 2, 2023

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

kannon92 commented Oct 3, 2023

johnbelamaric commented Oct 4, 2023

k8s-ci-robot commented Oct 4, 2023