Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restarting a kubelet should never affect the running workload #123980

Open
davshen opened this issue Mar 19, 2024 · 27 comments · May be fixed by #124367
Open

restarting a kubelet should never affect the running workload #123980

davshen opened this issue Mar 19, 2024 · 27 comments · May be fixed by #124367
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@davshen
Copy link

davshen commented Mar 19, 2024

What happened?

1.Node label changed.
Node label changed is because the operation and maintenance engineer organized the node label, such as hpc=true, and removed this label. When a pod is compatible with this label, restarting the node kubelet will cause the pod to be rebuilt. This actually shouldn’t be the case. It will affect the normal operation of the business.
2.#123971
3.#123816
After the above three scenarios occur, as long as the kubelet is restarted, the pod will be evicted. Obviously, this is not as expected. It is also an undesirable result for online business.

What did you expect to happen?

Restarting a kubelet should not cause any disruption to the running workload (which likely will mean skip admission of running pods, but let's not run ahead of ourselves) and backlink to this issue is probably the best way forward still.

How can we reproduce it (as minimally and precisely as possible)?

Before restarting the kubelet, fully check the scenarios mentioned above.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.17", GitCommit:"a7736eaf34d823d7652415337ac0ad06db9167fc", GitTreeState:"clean", BuildDate:"2022-12-08T11:47:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.17", GitCommit:"a7736eaf34d823d7652415337ac0ad06db9167fc", GitTreeState:"clean", BuildDate:"2022-12-08T11:42:04Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@davshen davshen added the kind/bug Categorizes issue or PR as related to a bug. label Mar 19, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 19, 2024
@davshen
Copy link
Author

davshen commented Mar 19, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 19, 2024
@davshen
Copy link
Author

davshen commented Mar 19, 2024

#123859

@davshen
Copy link
Author

davshen commented Mar 19, 2024

I think when the admit fails, the pod creation time and current status should be judged. If the pod is currently in the running state, and the pod creation time is earlier than the kubelet startup time, skip rejecting the pod.

@bart0sh bart0sh added this to Triage in SIG Node CI/Test Board Mar 19, 2024
@ffromani
Copy link
Contributor

/triage accepted
/priority important-longterm

We are aware of cases on which a kubelet restart perturbs the workload, and it shouldn't. We will collect the cases on this issue.
I'm torn between backlog and important-longterm. Setting the latter because otherwise this issue will likely slip through cracks.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2024
@ffromani
Copy link
Contributor

I think when the admit fails, the pod creation time and current status should be judged. If the pod is currently in the running state, and the pod creation time is earlier than the kubelet startup time, skip rejecting the pod.

I tend to agree but with more emphasis on the information from the runtime. We probably need to integrate the reported runtime state into the pod creation loop, which is likely a nontrivial effort.

@ffromani
Copy link
Contributor

/cc @bobbypage @smarterclayton

@davshen
Copy link
Author

davshen commented Mar 20, 2024

#118559 Here is another case about "Running pods with devices are terminated if kubelet is restarted"
The way he modified it was #119432

@Mihawk-lf
Copy link

Mihawk-lf commented Mar 22, 2024

How long does it take for kubelet to restart?

My test is normal.
If the kubelet restart takes too long, it will cause the pod to be evicted. The pod fixed on this machine will be restarted.

@davshen
Copy link
Author

davshen commented Mar 24, 2024

How long does it take for kubelet to restart?

My test is normal. If the kubelet restart takes too long, it will cause the pod to be evicted. The pod fixed on this machine will be restarted.

You need to label the node first, such as hpc=true, and then create a pod that is compatible with this label. After the creation is successful, remove the hpc=true label and restart kubelet, and a pod failed error will appear.

@pacoxu
Copy link
Member

pacoxu commented Apr 18, 2024

I prefer to think this is a designed behavior.

If the node label is changed and the pod affinity doesn't match, the pod can be rejected during kubelet restart.(especially when node label is changed by kubelet flags.)

--node-labels <key=value pairs>
<Warning: Alpha feature>Labels to add when registering the node in the cluster. Labels must be key=value pairs separated by ','. Labels in the 'kubernetes.io' namespace must begin with an allowed prefix ('kubelet.kubernetes.io', 'node.kubernetes.io') or be in the specifically allowed set ('beta.kubernetes.io/arch', 'beta.kubernetes.io/instance-type', 'beta.kubernetes.io/os', 'failure-domain.beta.kubernetes.io/region', 'failure-domain.beta.kubernetes.io/zone', 'kubernetes.io/arch', 'kubernetes.io/hostname', 'kubernetes.io/os', 'node.kubernetes.io/instance-type', 'topology.kubernetes.io/region', 'topology.kubernetes.io/zone')

/cc @yujuhong @dchen1107

@ffromani
Copy link
Contributor

After some extra conversations, we can narrow down quite significantly I think. "restarting never affect the running workload" is likely too broad. There are legit cases on which a kubelet restart should perturb the running workload. A simple example is when --reserved-cpus change values.

But I still very much believe that if the resource availability does not change (= configuration doesn't, hardware doesn't), then yes, a kubelet restart should not perturb the running guaranteed QoS workload, which is the only workload which can get exclusive resource assignment.

@yujuhong
Copy link
Contributor

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

@Homura222
Copy link

Homura222 commented Apr 19, 2024

I prefer to think this is a designed behavior.

If the node label is changed and the pod affinity doesn't match, the pod can be rejected during kubelet restart.(especially when node label is changed by kubelet flags.)

--node-labels <key=value pairs>
<Warning: Alpha feature>Labels to add when registering the node in the cluster. Labels must be key=value pairs separated by ','. Labels in the 'kubernetes.io' namespace must begin with an allowed prefix ('kubelet.kubernetes.io', 'node.kubernetes.io') or be in the specifically allowed set ('beta.kubernetes.io/arch', 'beta.kubernetes.io/instance-type', 'beta.kubernetes.io/os', 'failure-domain.beta.kubernetes.io/region', 'failure-domain.beta.kubernetes.io/zone', 'kubernetes.io/arch', 'kubernetes.io/hostname', 'kubernetes.io/os', 'node.kubernetes.io/instance-type', 'topology.kubernetes.io/region', 'topology.kubernetes.io/zone')

/cc @yujuhong @dchen1107

I believe that if this behavior is by design, then the pod should be killed when the node label is changed and the pod affinity doesn't match, rather than waiting until the kubelet restarts to reject the pod.

@gjkim42
Copy link
Member

gjkim42 commented Apr 19, 2024

I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

+1


If this issue is all about the node affinity,
the kubelet should not check any node affinity or pod affinity, as they should only be considered by the scheduler.

refs:

Note: In the preceding types, IgnoredDuringExecution means that if the node labels change after Kubernetes schedules the Pod, the Pod continues to run.

@chengjoey
Copy link
Contributor

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

For example, some pods have affinity set, kubernetes.io/region in Asia. At this time, these pods are running normally on the asia nodes. Then remove the label kubernetes.io/region of these asia nodes. After removing the label, the pods running on them are still running normally. However, after restarting the kubelet on these nodes, the pods running normally are killed and new nodes that meet the requirements are selected.

Either kubelet should kill these pods after asia nodes removes the label, or let them continue to run instead of rescheduling them after kubelet restart.

@gjkim42
Copy link
Member

gjkim42 commented Apr 19, 2024

However, after restarting the kubelet on these nodes, the pods running normally are killed and new nodes that meet the requirements are selected.

I think this is the bug. kubelet should let them continue to run after its restart, as node affinity should be considered only in the kube-scheduler.

@aojea
Copy link
Member

aojea commented Apr 19, 2024

This has a lot of edge cases with a very large blast radius , last time we did a small change on the lifecycle of the Pods it took several releases to stabilize again , I think this requires a KEP to deeply discuss the scope as per @ffromani comment here #123980 (comment) and, if accepted, rollout this change with a feature gate so we have the capacity to evaluate the impact in production and roll back if necessary

@ffromani
Copy link
Contributor

ffromani commented Apr 19, 2024

I agree with @aojea that a resolution would totally needs a KEP and close oversight. It's still very worthy to discuss the issue, which has its merit. Perhaps is time to further split this issue and differentiate between the labelling issue (I tend to agree with @gjkim42) and the exclusive resource allocation issue.

@davshen
Copy link
Author

davshen commented Apr 23, 2024

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

For example, I have many different types of nodes. At first, they were labeled with many types of labels, including the label hpc=true. However, after a long time, the cluster administrator wanted to sort out the labels of all nodes in the cluster. I mistakenly deleted the hpc=true label. At this time, if you restart the kubelet, the pod with hpc=true affinity will be evicted and no suitable node can be found, causing an online business failure.

@aojea
Copy link
Member

aojea commented Apr 23, 2024

but the system should not compensate for human mistakes ... what if the change was done purpose and the admin wanted the pod to be evicted?

@davshen
Copy link
Author

davshen commented Apr 23, 2024

I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more?

For example, some pods have affinity set, kubernetes.io/region in Asia. At this time, these pods are running normally on the asia nodes. Then remove the label kubernetes.io/region of these asia nodes. After removing the label, the pods running on them are still running normally. However, after restarting the kubelet on these nodes, the pods running normally are killed and new nodes that meet the requirements are selected.

Either kubelet should kill these pods after asia nodes removes the label, or let them continue to run instead of rescheduling them after kubelet restart.

@aojea Perhaps this behavior, I think, is more appropriate. The eviction action should not occur when the kubelet is restarted, but when the label is deleted. This is more in line with the final state-oriented design concept of k8s.

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2024

I think the issue of evicting pods based on node affinity should be separated from this issue.
It is a well-documented behavior and looks like an obvious bug to me. At least we have to update the document instead.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

Screenshot 2024-04-23 at 3 51 20 PM

@AllenXu93
Copy link

I think the point is that restart kubelet should not affect any running workload.
I have met similar scenario, when restarting Kubelet, some running pod failed due to a init-container NUMA calculation error.
#123816 (comment)
Both this case or node's label change should not make pod failed.

@aojea
Copy link
Member

aojea commented Apr 23, 2024

I think the point is that restart kubelet should not affect any running workload

what do we mean by restart, it is killing the kubelet process to start a new one, no? does "restart" has a duration or can be undetermined the time of a restart?
what if I stop kubelet , do some maintenance and start it again, is that considered a "restart"?
what if something has changed in the node that impact the workloads?

@AllenXu93
Copy link

I think the point is that restart kubelet should not affect any running workload

what do we mean by restart, it is killing the kubelet process to start a new one, no? does "restart" has a duration or can be undetermined the time of a restart? what if I stop kubelet , do some maintenance and start it again, is that considered a "restart"? what if something has changed in the node that impact the workloads?

You are right. Maybe we can add a feature gate, the operator should know what whether this restart need to evict pods.
As you said the first type restart killing the kubelet process to start a new one, operator can turn on feature gate to make sure his operation safety

@gjkim42
Copy link
Member

gjkim42 commented Apr 28, 2024

I think the issue of evicting pods based on node affinity should be separated from this issue. It is a well-documented behavior and looks like an obvious bug to me. At least we have to update the document instead.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

Screenshot 2024-04-23 at 3 51 20 PM

opened an issue for this.

@rmohr
Copy link
Contributor

rmohr commented May 15, 2024

but the system should not compensate for human mistakes ...

@aojea I think this is an issue for long running operations. Examples would be: AI jobs

With the current implementation basically the following is said: Every cluster admin has to ensure that pods are safely drained or done executing before the kubelet gets restarted if they plan to restart (and this can even happen happen involuntarily with NPD). So relabeling nodes would become a super-hard task and not about mistakes necessarily

what if the change was done purpose and the admin wanted the pod to be evicted?

Relying here on a kubelet restart to make this happen does not sound like something which anyone is using today? To actually properly do this, there would be RequiredDuringSchedulingIgnoreDuringExecution and RequiredDuringSchedulingRequiredDuringExecution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects