-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restarting a kubelet should never affect the running workload #123980
Comments
/sig node |
I think when the admit fails, the pod creation time and current status should be judged. If the pod is currently in the running state, and the pod creation time is earlier than the kubelet startup time, skip rejecting the pod. |
/triage accepted We are aware of cases on which a kubelet restart perturbs the workload, and it shouldn't. We will collect the cases on this issue. |
I tend to agree but with more emphasis on the information from the runtime. We probably need to integrate the reported runtime state into the pod creation loop, which is likely a nontrivial effort. |
How long does it take for kubelet to restart? My test is normal. |
You need to label the node first, such as hpc=true, and then create a pod that is compatible with this label. After the creation is successful, remove the hpc=true label and restart kubelet, and a pod failed error will appear. |
I prefer to think this is a designed behavior. If the node label is changed and the pod affinity doesn't match, the pod can be rejected during kubelet restart.(especially when node label is changed by kubelet flags.)
/cc @yujuhong @dchen1107 |
After some extra conversations, we can narrow down quite significantly I think. "restarting never affect the running workload" is likely too broad. There are legit cases on which a kubelet restart should perturb the running workload. A simple example is when But I still very much believe that if the resource availability does not change (= configuration doesn't, hardware doesn't), then yes, a kubelet restart should not perturb the running guaranteed QoS workload, which is the only workload which can get exclusive resource assignment. |
I agree that it's worth revisiting the pod admission logic in kubelet, but the particular case of rejecting pods due to changed labels seems working as intended. I don't think I fully understand the use case of this. Maybe someone can help elaborate more? |
I believe that if this behavior is by design, then the pod should be killed when the node label is changed and the pod affinity doesn't match, rather than waiting until the kubelet restarts to reject the pod. |
+1 If this issue is all about the node affinity, refs:
|
For example, some pods have affinity set, Either kubelet should kill these pods after asia nodes removes the label, or let them continue to run instead of rescheduling them after kubelet restart. |
I think this is the bug. kubelet should let them continue to run after its restart, as node affinity should be considered only in the kube-scheduler. |
This has a lot of edge cases with a very large blast radius , last time we did a small change on the lifecycle of the Pods it took several releases to stabilize again , I think this requires a KEP to deeply discuss the scope as per @ffromani comment here #123980 (comment) and, if accepted, rollout this change with a feature gate so we have the capacity to evaluate the impact in production and roll back if necessary |
I agree with @aojea that a resolution would totally needs a KEP and close oversight. It's still very worthy to discuss the issue, which has its merit. Perhaps is time to further split this issue and differentiate between the labelling issue (I tend to agree with @gjkim42) and the exclusive resource allocation issue. |
For example, I have many different types of nodes. At first, they were labeled with many types of labels, including the label hpc=true. However, after a long time, the cluster administrator wanted to sort out the labels of all nodes in the cluster. I mistakenly deleted the hpc=true label. At this time, if you restart the kubelet, the pod with hpc=true affinity will be evicted and no suitable node can be found, causing an online business failure. |
but the system should not compensate for human mistakes ... what if the change was done purpose and the admin wanted the pod to be evicted? |
@aojea Perhaps this behavior, I think, is more appropriate. The eviction action should not occur when the kubelet is restarted, but when the label is deleted. This is more in line with the final state-oriented design concept of k8s. |
I think the issue of evicting pods based on node affinity should be separated from this issue. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity |
I think the point is that restart kubelet should not affect any running workload. |
what do we mean by restart, it is killing the kubelet process to start a new one, no? does "restart" has a duration or can be undetermined the time of a restart? |
You are right. Maybe we can add a feature gate, the operator should know what whether this restart need to evict pods. |
opened an issue for this. |
@aojea I think this is an issue for long running operations. Examples would be: AI jobs With the current implementation basically the following is said: Every cluster admin has to ensure that pods are safely drained or done executing before the kubelet gets restarted if they plan to restart (and this can even happen happen involuntarily with NPD). So relabeling nodes would become a super-hard task and not about mistakes necessarily
Relying here on a kubelet restart to make this happen does not sound like something which anyone is using today? To actually properly do this, there would be |
What happened?
1.Node label changed.
Node label changed is because the operation and maintenance engineer organized the node label, such as hpc=true, and removed this label. When a pod is compatible with this label, restarting the node kubelet will cause the pod to be rebuilt. This actually shouldn’t be the case. It will affect the normal operation of the business.
2.#123971
3.#123816
After the above three scenarios occur, as long as the kubelet is restarted, the pod will be evicted. Obviously, this is not as expected. It is also an undesirable result for online business.
What did you expect to happen?
Restarting a kubelet should not cause any disruption to the running workload (which likely will mean skip admission of running pods, but let's not run ahead of ourselves) and backlink to this issue is probably the best way forward still.
How can we reproduce it (as minimally and precisely as possible)?
Before restarting the kubelet, fully check the scenarios mentioned above.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: