restarting a kubelet should never affect the running workload #123980

davshen · 2024-03-19T02:54:10Z

What happened?

1.Node label changed.
Node label changed is because the operation and maintenance engineer organized the node label, such as hpc=true, and removed this label. When a pod is compatible with this label, restarting the node kubelet will cause the pod to be rebuilt. This actually shouldn’t be the case. It will affect the normal operation of the business.
2.#123971
3.#123816
After the above three scenarios occur, as long as the kubelet is restarted, the pod will be evicted. Obviously, this is not as expected. It is also an undesirable result for online business.

What did you expect to happen?

Restarting a kubelet should not cause any disruption to the running workload (which likely will mean skip admission of running pods, but let's not run ahead of ourselves) and backlink to this issue is probably the best way forward still.

How can we reproduce it (as minimally and precisely as possible)?

Before restarting the kubelet, fully check the scenarios mentioned above.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.17", GitCommit:"a7736eaf34d823d7652415337ac0ad06db9167fc", GitTreeState:"clean", BuildDate:"2022-12-08T11:47:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.17", GitCommit:"a7736eaf34d823d7652415337ac0ad06db9167fc", GitTreeState:"clean", BuildDate:"2022-12-08T11:42:04Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

davshen · 2024-03-19T02:54:46Z

/sig node

davshen · 2024-03-19T02:55:12Z

#123859

davshen · 2024-03-19T03:48:06Z

I think when the admit fails, the pod creation time and current status should be judged. If the pod is currently in the running state, and the pod creation time is earlier than the kubelet startup time, skip rejecting the pod.

ffromani · 2024-03-20T06:15:42Z

/triage accepted
/priority important-longterm

We are aware of cases on which a kubelet restart perturbs the workload, and it shouldn't. We will collect the cases on this issue.
I'm torn between backlog and important-longterm. Setting the latter because otherwise this issue will likely slip through cracks.

ffromani · 2024-03-20T06:17:21Z

I think when the admit fails, the pod creation time and current status should be judged. If the pod is currently in the running state, and the pod creation time is earlier than the kubelet startup time, skip rejecting the pod.

I tend to agree but with more emphasis on the information from the runtime. We probably need to integrate the reported runtime state into the pod creation loop, which is likely a nontrivial effort.

ffromani · 2024-03-20T06:17:30Z

/cc @bobbypage @smarterclayton

davshen · 2024-03-20T10:27:28Z

#118559 Here is another case about "Running pods with devices are terminated if kubelet is restarted"
The way he modified it was #119432

Mihawk-lf · 2024-03-22T06:05:43Z

How long does it take for kubelet to restart?

My test is normal.
If the kubelet restart takes too long, it will cause the pod to be evicted. The pod fixed on this machine will be restarted.

davshen · 2024-03-24T03:23:54Z

How long does it take for kubelet to restart?

My test is normal. If the kubelet restart takes too long, it will cause the pod to be evicted. The pod fixed on this machine will be restarted.

You need to label the node first, such as hpc=true, and then create a pod that is compatible with this label. After the creation is successful, remove the hpc=true label and restart kubelet, and a pod failed error will appear.

pacoxu · 2024-04-18T14:30:35Z

I prefer to think this is a designed behavior.

If the node label is changed and the pod affinity doesn't match, the pod can be rejected during kubelet restart.(especially when node label is changed by kubelet flags.)

--node-labels <key=value pairs>
<Warning: Alpha feature>Labels to add when registering the node in the cluster. Labels must be key=value pairs separated by ','. Labels in the 'kubernetes.io' namespace must begin with an allowed prefix ('kubelet.kubernetes.io', 'node.kubernetes.io') or be in the specifically allowed set ('beta.kubernetes.io/arch', 'beta.kubernetes.io/instance-type', 'beta.kubernetes.io/os', 'failure-domain.beta.kubernetes.io/region', 'failure-domain.beta.kubernetes.io/zone', 'kubernetes.io/arch', 'kubernetes.io/hostname', 'kubernetes.io/os', 'node.kubernetes.io/instance-type', 'topology.kubernetes.io/region', 'topology.kubernetes.io/zone')

/cc @yujuhong @dchen1107

ffromani · 2024-04-18T14:43:14Z

After some extra conversations, we can narrow down quite significantly I think. "restarting never affect the running workload" is likely too broad. There are legit cases on which a kubelet restart should perturb the running workload. A simple example is when --reserved-cpus change values.

But I still very much believe that if the resource availability does not change (= configuration doesn't, hardware doesn't), then yes, a kubelet restart should not perturb the running guaranteed QoS workload, which is the only workload which can get exclusive resource assignment.

yujuhong · 2024-04-18T21:52:20Z

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

Homura222 · 2024-04-19T02:54:16Z

I prefer to think this is a designed behavior.

If the node label is changed and the pod affinity doesn't match, the pod can be rejected during kubelet restart.(especially when node label is changed by kubelet flags.)

--node-labels <key=value pairs>
<Warning: Alpha feature>Labels to add when registering the node in the cluster. Labels must be key=value pairs separated by ','. Labels in the 'kubernetes.io' namespace must begin with an allowed prefix ('kubelet.kubernetes.io', 'node.kubernetes.io') or be in the specifically allowed set ('beta.kubernetes.io/arch', 'beta.kubernetes.io/instance-type', 'beta.kubernetes.io/os', 'failure-domain.beta.kubernetes.io/region', 'failure-domain.beta.kubernetes.io/zone', 'kubernetes.io/arch', 'kubernetes.io/hostname', 'kubernetes.io/os', 'node.kubernetes.io/instance-type', 'topology.kubernetes.io/region', 'topology.kubernetes.io/zone')

/cc @yujuhong @dchen1107

I believe that if this behavior is by design, then the pod should be killed when the node label is changed and the pod affinity doesn't match, rather than waiting until the kubelet restarts to reject the pod.

gjkim42 · 2024-04-19T03:19:29Z

I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

+1

If this issue is all about the node affinity,
the kubelet should not check any node affinity or pod affinity, as they should only be considered by the scheduler.

refs:

Note: In the preceding types, IgnoredDuringExecution means that if the node labels change after Kubernetes schedules the Pod, the Pod continues to run.

chengjoey · 2024-04-19T06:47:07Z

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

For example, some pods have affinity set, kubernetes.io/region in Asia. At this time, these pods are running normally on the asia nodes. Then remove the label kubernetes.io/region of these asia nodes. After removing the label, the pods running on them are still running normally. However, after restarting the kubelet on these nodes, the pods running normally are killed and new nodes that meet the requirements are selected.

Either kubelet should kill these pods after asia nodes removes the label, or let them continue to run instead of rescheduling them after kubelet restart.

gjkim42 · 2024-04-19T06:53:27Z

However, after restarting the kubelet on these nodes, the pods running normally are killed and new nodes that meet the requirements are selected.

I think this is the bug. kubelet should let them continue to run after its restart, as node affinity should be considered only in the kube-scheduler.

aojea · 2024-04-19T08:14:52Z

This has a lot of edge cases with a very large blast radius , last time we did a small change on the lifecycle of the Pods it took several releases to stabilize again , I think this requires a KEP to deeply discuss the scope as per @ffromani comment here #123980 (comment) and, if accepted, rollout this change with a feature gate so we have the capacity to evaluate the impact in production and roll back if necessary

ffromani · 2024-04-19T08:35:37Z

I agree with @aojea that a resolution would totally needs a KEP and close oversight. It's still very worthy to discuss the issue, which has its merit. Perhaps is time to further split this issue and differentiate between the labelling issue (I tend to agree with @gjkim42) and the exclusive resource allocation issue.

davshen · 2024-04-23T05:51:15Z

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

For example, I have many different types of nodes. At first, they were labeled with many types of labels, including the label hpc=true. However, after a long time, the cluster administrator wanted to sort out the labels of all nodes in the cluster. I mistakenly deleted the hpc=true label. At this time, if you restart the kubelet, the pod with hpc=true affinity will be evicted and no suitable node can be found, causing an online business failure.

aojea · 2024-04-23T06:00:51Z

but the system should not compensate for human mistakes ... what if the change was done purpose and the admin wanted the pod to be evicted?

davshen · 2024-04-23T06:04:29Z

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

For example, some pods have affinity set, kubernetes.io/region in Asia. At this time, these pods are running normally on the asia nodes. Then remove the label kubernetes.io/region of these asia nodes. After removing the label, the pods running on them are still running normally. However, after restarting the kubelet on these nodes, the pods running normally are killed and new nodes that meet the requirements are selected.

Either kubelet should kill these pods after asia nodes removes the label, or let them continue to run instead of rescheduling them after kubelet restart.

@aojea Perhaps this behavior, I think, is more appropriate. The eviction action should not occur when the kubelet is restarted, but when the label is deleted. This is more in line with the final state-oriented design concept of k8s.

gjkim42 · 2024-04-23T06:53:28Z

I think the issue of evicting pods based on node affinity should be separated from this issue.
It is a well-documented behavior and looks like an obvious bug to me. At least we have to update the document instead.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

AllenXu93 · 2024-04-23T07:16:06Z

I think the point is that restart kubelet should not affect any running workload.
I have met similar scenario, when restarting Kubelet, some running pod failed due to a init-container NUMA calculation error.
#123816 (comment)
Both this case or node's label change should not make pod failed.

aojea · 2024-04-23T07:38:01Z

I think the point is that restart kubelet should not affect any running workload

what do we mean by restart, it is killing the kubelet process to start a new one, no? does "restart" has a duration or can be undetermined the time of a restart?
what if I stop kubelet , do some maintenance and start it again, is that considered a "restart"?
what if something has changed in the node that impact the workloads?

AllenXu93 · 2024-04-23T07:52:57Z

I think the point is that restart kubelet should not affect any running workload

what do we mean by restart, it is killing the kubelet process to start a new one, no? does "restart" has a duration or can be undetermined the time of a restart? what if I stop kubelet , do some maintenance and start it again, is that considered a "restart"? what if something has changed in the node that impact the workloads?

You are right. Maybe we can add a feature gate, the operator should know what whether this restart need to evict pods.
As you said the first type restart killing the kubelet process to start a new one, operator can turn on feature gate to make sure his operation safety

gjkim42 · 2024-04-28T05:22:19Z

I think the issue of evicting pods based on node affinity should be separated from this issue. It is a well-documented behavior and looks like an obvious bug to me. At least we have to update the document instead.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

opened an issue for this.

rmohr · 2024-05-15T17:59:17Z

but the system should not compensate for human mistakes ...

@aojea I think this is an issue for long running operations. Examples would be: AI jobs

With the current implementation basically the following is said: Every cluster admin has to ensure that pods are safely drained or done executing before the kubelet gets restarted if they plan to restart (and this can even happen happen involuntarily with NPD). So relabeling nodes would become a super-hard task and not about mistakes necessarily

what if the change was done purpose and the admin wanted the pod to be evicted?

Relying here on a kubelet restart to make this happen does not sound like something which anyone is using today? To actually properly do this, there would be RequiredDuringSchedulingIgnoreDuringExecution and RequiredDuringSchedulingRequiredDuringExecution?

davshen added the kind/bug Categorizes issue or PR as related to a bug. label Mar 19, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 19, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 19, 2024

bart0sh added this to Triage in SIG Node CI/Test Board Mar 19, 2024

davshen mentioned this issue Mar 21, 2024

Admit failed, intercepting Pod Failed, bug fix. #124014

Open

yangyefei90 mentioned this issue Mar 27, 2024

restarting a kubelet should never affect the running workload #124072

Closed

SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 27, 2024

SergeyKanzhelev removed this from Triage in SIG Node CI/Test Board Apr 3, 2024

haircommander moved this from Triage to Triaged in SIG Node Bugs Apr 10, 2024

yangyefei90 mentioned this issue Apr 11, 2024

restarting a kubelet should never affect the running workload #124267

Open

ffromani mentioned this issue Apr 17, 2024

Kubelet restart cause running pod restart with UnexpectedAdmissionError when pods have initContainers and external devices like GPU #124345

Open

Homura222 linked a pull request Apr 18, 2024 that will close this issue

restarting a kubelet should never affect the running workload when node label is not match pod #124367

Open

chengjoey mentioned this issue Apr 22, 2024

add e2e test for restart kubelet #124445

Open

Homura222 mentioned this issue Apr 24, 2024

vm restart when kubelet restart kubevirt/kubevirt#11662

Open

gjkim42 mentioned this issue Apr 28, 2024

Restarted kubelet should not evict pods due to the node affinity #124586

Open

Homura222 mentioned this issue May 3, 2024

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. kubevirt/kubevirt#11843

Open

ffromani mentioned this issue May 16, 2024

Running pods with devices are terminated if kubelet is restarted #118559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restarting a kubelet should never affect the running workload #123980

restarting a kubelet should never affect the running workload #123980

davshen commented Mar 19, 2024

davshen commented Mar 19, 2024

davshen commented Mar 19, 2024

davshen commented Mar 19, 2024

ffromani commented Mar 20, 2024

ffromani commented Mar 20, 2024

ffromani commented Mar 20, 2024

davshen commented Mar 20, 2024

Mihawk-lf commented Mar 22, 2024 •

edited

davshen commented Mar 24, 2024

pacoxu commented Apr 18, 2024 •

edited

ffromani commented Apr 18, 2024

yujuhong commented Apr 18, 2024

Homura222 commented Apr 19, 2024 •

edited

gjkim42 commented Apr 19, 2024 •

edited

chengjoey commented Apr 19, 2024

gjkim42 commented Apr 19, 2024

aojea commented Apr 19, 2024 •

edited

ffromani commented Apr 19, 2024 •

edited

davshen commented Apr 23, 2024

aojea commented Apr 23, 2024 •

edited

davshen commented Apr 23, 2024

gjkim42 commented Apr 23, 2024

AllenXu93 commented Apr 23, 2024

aojea commented Apr 23, 2024

AllenXu93 commented Apr 23, 2024

gjkim42 commented Apr 28, 2024 •

edited

rmohr commented May 15, 2024 •

edited

restarting a kubelet should never affect the running workload #123980

restarting a kubelet should never affect the running workload #123980

Comments

davshen commented Mar 19, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

davshen commented Mar 19, 2024

davshen commented Mar 19, 2024

davshen commented Mar 19, 2024

ffromani commented Mar 20, 2024

ffromani commented Mar 20, 2024

ffromani commented Mar 20, 2024

davshen commented Mar 20, 2024

Mihawk-lf commented Mar 22, 2024 • edited

davshen commented Mar 24, 2024

pacoxu commented Apr 18, 2024 • edited

ffromani commented Apr 18, 2024

yujuhong commented Apr 18, 2024

Homura222 commented Apr 19, 2024 • edited

gjkim42 commented Apr 19, 2024 • edited

chengjoey commented Apr 19, 2024

gjkim42 commented Apr 19, 2024

aojea commented Apr 19, 2024 • edited

ffromani commented Apr 19, 2024 • edited

davshen commented Apr 23, 2024

aojea commented Apr 23, 2024 • edited

davshen commented Apr 23, 2024

gjkim42 commented Apr 23, 2024

AllenXu93 commented Apr 23, 2024

aojea commented Apr 23, 2024

AllenXu93 commented Apr 23, 2024

gjkim42 commented Apr 28, 2024 • edited

rmohr commented May 15, 2024 • edited

Mihawk-lf commented Mar 22, 2024 •

edited

pacoxu commented Apr 18, 2024 •

edited

Homura222 commented Apr 19, 2024 •

edited

gjkim42 commented Apr 19, 2024 •

edited

aojea commented Apr 19, 2024 •

edited

ffromani commented Apr 19, 2024 •

edited

aojea commented Apr 23, 2024 •

edited

gjkim42 commented Apr 28, 2024 •

edited

rmohr commented May 15, 2024 •

edited