-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful Node Shutdown does not update endpoints for terminating pods #116965
Comments
/sig node |
/assign |
i don't know your backend implementation of service,it’s significant feature to route backend healthy backend。the terminating pod in EndpointSlice is not ready as described below,shouldn't route new traffic to it。
but it‘s unreasonable terminating pod in EndpointSlice,should like Endpoints。 |
It will be important to check the logs of the endpointslice controller in the controller-manager and correlate with the pod and node states |
Terminating pods appearing in EndpointSlice might be intentional due to KEP-1672, though that's just a guess from the name. However, I'm not seeing what you posted above. What I'm seeing is both I updated the repro a little bit to print the preStop hook sequence in the log & an increase the grace period. It doesn't materially change anything. Here's the video of what happens from the repo I linked on my computer, with controller logs: https://drive.google.com/file/d/1-_hsvkJq3cOtLxEzQ2R8Fy4MEXka2Z82/view?usp=sharing . Both |
Do you see the terminating flag value on Endpoint slice (https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating) changed for the terminating pod? /triage needs-information Also if you can share kubelet.log and @aojea requested the logs of the endpointslice controller - it will be helpful |
The
I captured the logs in the video I posted above as well. Here are the logs pasted (from a second attempt). I shut down the node @ kube-controller-manager.log
kubelet.log
|
all the components can increase the verbosity level with the |
The endpointslice reflects the state of the Pod, it literally checks the pod state, same you are doing with
you'll need to check the pod at that time, it should be ready, if not, there is a problem we need to investigate more |
The pod also reports everything is great while shutting down:
Afaict the pod never even gets marked as
|
Seems the code around pod status merging is at least partially responsible for the dropped api server update. The shutdown manager code only sets Not sure what the right fix is here, but I think that's the root cause? If I'm understanding the code right, this would affect pod eviction and any other kubelet-led graceful pod shutdown. |
@SergeyKanzhelev would you need anything else from me to triage this? This may be a bit presumptuous but I noticed @mimowo & @smarterclayton worked on this area recently. Would either of you have the time to see if my diagnosis in the previous comment is correct/useful? |
/triage accepted xref to the KEP: kubernetes/enhancements#2000 to consider when going GA Will this problem be the same when we try to delete Pod due to some eviction reasons than? The PR is changing the existing tests - need to understand that the proposed fix is not regressing some other behavior. |
/remove-triage needs-information |
Yes - I believe so. All kubelet-led evictions only set a terminal phase and expect that to propagate to the API server to update readiness conditions. However that currently doesn't happen since that update is squelched until the containers finish terminating. The overall issue comes from the fact that "terminating" in the endpoints world doesn't match "terminating" in the pod world. i.e.
DeletionTimestamp is not set).
|
can you elaborate? is the pod ready during that time? |
I'm not sure that I know the code well enough, but the core problem is the title of this ticket. Basically you'll have to define what you mean by "ready" - the readiness probe might be succeeding during this time just fine, but in endpoints-land the kubernetes/pkg/controller/endpointslice/utils.go Lines 42 to 46 in cb7acfd
There is probably a larger discussion to be had since the assumption that kubernetes/pkg/controller/controller_utils.go Lines 987 to 991 in 2c6c456
|
when execute a shutdown operation, if the connection between kubelet and kube-apiserver is disconnected, any attempts to delete or mark services as unhealthy will be ineffective, because kube-apiserver can not update. The kube-apiserver is typically deployed in static or systemd service mode, and it is crucial to ensure that kube-apiserver is the last to stop. This will update pod status in the cluster, and should add the highest priority in static pod, like this
And when kube-apiserver can not connected, but node had shutdown, than the nodelifecycle controller will work on this and should mark the service as unhealthy, that is the #109998 happened. |
@yylt You might be discussing a different problem there. This issue is not about shutting down the api server - the "graceful node shutdown" is of a worker node, not an API server node. Furthermore - the network is perfect in my case - this is "normal operation" of the graceful node shutdown feature - not related to the network conditions or other things happening to the API server or really anywhere else. |
@mikekap could you check if this issue reproduces on 1.27? There were pretty important changes for the pod lifecycle. In particular in #115331 we make sure kubelet transitions a terminating pod to terminal phase before actual deletion (on 1.26 the pod might have been deleted in the Running phase). This was required for Jobs to make sure the Job controller observes the pod as terminated, as it uses finalizers. However, I'm not sure about endpointslice controller. |
Ready to serve traffic means the pod is ready and not terminating, so it keeps serving the existing traffic but we don't send any traffic to a pod that we know eventually will disappear and blackhole traffic creating disruption
What discussion? DeletionTimestamp IS the marker for "pod is terminating", if not we have a big problem, components communicate through the API that represent the objects, and objects have a lifecycle kubernetes/community#6787 , if the pod is being deleted it has to represent its state in the API object so all the components act accordenly |
The Kubelet evicts pods using the The function triggers termination of containers, so the pods are terminating in a sense, but don't think the function sets DeletionTimestamp, and I suppose it might be a bug in kubelet, but quite old. if so, it might be worth considering changes in other controllers. cc @bobbypage @smarterclayton @alculquicondor for insights. EDIT; still it would be good to see if this remains an issue e2e on 1.27 as suggested in #116965 (comment) |
Just reproduced the same behavior in 1.27.1. The repro repo in the first comment is updated to 1.27.1 now. When sending an ACPI shutdown, the pod begins shutdown (the preStop hook runs) but while the hook is running, the pod continues to have:
in the endpointslice. re: the discussion - before this issue, I was under the same impression that |
@mikekap thanks for reproducing - at least we know the new version does not help. I'm not sure what the "right approach" is. IIUC the comment suggests kubelet should set the Maybe some sig-node folks can give more context if this is a bug, or there are some good reasons. |
Hi, we also encounter this issue on 1.27.6. What we see, is that after the node deletion triggered, pod is still Running during the prestop lifecycle, also when sigterm is sent. Then the pod transition to status to completed. Endpoint is updated when the pod pass in terminating state, but because during shutdown it never transition to Terminating, it stills send traffic to this pod until the pod is Completed, which cause errors. |
/remove-triage accepted Can someone from SIG node re-triage this? IIUC, the question is how/when does kubelet report to the apiserver that the Pod is terminating when doing a graceful node shutdown, so that other controllers can react. |
Double check that your service has a sigterm handler that sets the readiness endpoint to false. Some of the behavior described on this ticket is expected behavior. If the readiness endpoint is never set to false, then the pod is expected to stay ready until the entire lifecycle is terminated. The kubelet also needs to be setup in a way where if it is running under systemd the dependency chain has the kubelet terminate prior to networking. |
Also note, the pre-stop hook blocks the pod termination until it has completed.
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-execution |
@rphillips I don't think this is specced behavior: readiness probes auto-fail when pods start shutdown normally. See this page: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
|
Because of a bug(#124648), readiness probe doesn’t work during pod termination triggered in kubelet such as eviction and node shutdown. This may be one of reasons why an endpoint is not updated early. |
@rphillips @smarterclayton it seems the main problem here is the lack of update on the status, once the status on eviction is propagated Terminating or Failing, Services will work correctly, we had a similar issue in the past #110255 |
So zooming out: GracefulShutdown by itself does not say whether the node is coming back or not - that's the responsibility of the agent that "owns" the node (cloud provider, human admin, vm orchestrator, whatever) to determine whether the node is coming back or not. So using graceful shutdown by itself can make no decision on whether to delete the pods or not, only to correctly represent that the pod is shutting down and being not ready. GracefulShutdown was not intended to replace a controlled drain operation on the node. If the use case described here is assuming that graceful shutdown will guarantee that pods are terminated, that is not generally true and not the design we intended for GracefulShutdown. Any "set delete timestamp" action would have to be opt in and probably not the responsibility of the kubelet generally - the responsibility today belongs to the initiator of the "node going away permanently" action (cloud provider, orchestrator, human) and I could argue it should stay there. I think we can say that GracefulShutdown is intended to end the processes of pods as best effort within a specific timeframe and report the status of those pods appropriately. At minimum, the pods should be described as non ready, and the containers should be described as stopped with terminal exit codes and/or phase (RestartAlways pods should be left in running phase, because there was no signal that the node is not coming back). But it is best effort - it cannot guarantee due to being part of a distributed system that those status updates will propagate. The remaining question is whether the node is doing a good enough job during graceful shutdown. I think the answer given this thread is "no" - I would expect the following (as i commented on the other bug):
Anything I'm missing? Does this match everyone's experience and expectations? |
One piece I might want to clarify is point 2:
This isn't entirely complete - the endpoint slices that include these pods need to be updated correctly. Specifically, there are 3 booleans currently exposed: "ready", "serving", and "terminating". For pods on nodes undergoing graceful node shutdown these should be:
Unfortunately doing this in the endpoint controller is impossible since the kubelet doesn't expose any indicator that the pod is shutting down. There may be a disruption condition now - so maybe this bug is about having the endpoint controller use that. Separately - and this is just a non-advanced end user opinion that you can definitely ignore - IMO kubelet eviction should use pod deletion. The problem ultimately is doing anything else violates the "principle of least surprise". During eviction the pod is "Terminating" but unless everyone remembers that there are 2 ways of checking that - deletion timestamp & a disruption condition (or something similar) - there will be bugs everywhere. Even kubectl doesn't do this right now (to show "Terminating"). Nobody expects "kubectl drain" and "sudo halt" to do anything different - it's a nasty surprise. I'm not totally sure I understand what's gained by leaving the undead pods around. In theory you might be able to restart them if the node comes back up, but even if that doesn't happen - the parent controller should be recreating them, right? Then is this done to protect "fully uncontrolled" pods (ie non-static and non-daemonset/replicaset/statefulset/etc)? These are super weird - you can't even use "kubectl delete" to restart them - right? I readily admit I have not run kube on bare metal or with pet nodes (only cattle nodes) - so I'd love to understand how this feature helps folks. Ultimately if allowing kubelets to actually delete pods is a breaking change in the kube API guarantees, IMO it's worth doing. It likely fixes more bugs than it creates. All that said, the endpoint slice controller behavior is what this bug report is about. So long as that readiness starts reporting false when termination begins - that works to fix this issue. |
What happened?
When graceful node shutdown is enabled, I expect that services are drained appropriately during pod shutdown (i.e. the same way that
kubectl delete pod
would work). However this doesn't seem to be the case - the pod never enters theTerminating
state according tokubectl
and doesn't get removed from the service endpoints until it enters theCompleted
orFailed
state. This causes connection errors and lost requests inCluster
-routed services.What did you expect to happen?
After pod termination starts, the EndpointSlice should be updated to have the pod that is terminating removed or marked terminating, even before the pod is fully shut down. This is what happens when a pod gets removed for any other reason (e.g.
kubectl delete pod
).How can we reproduce it (as minimally and precisely as possible)?
You can see the repro at https://github.com/mikekap/vagrant-kubeadm-kubernetes/tree/graceful-shutdown-repro . The repro sets up a 2 node cluster with graceful shutdown enabled and a simple deployment/service.
Here is the deployment/service yaml pasted again
To reproduce:
watch -n 1 'kubectl describe endpointslice my-service-5j2nw'
in a separate terminal (you'll need to figure out the right name unfortunately)Within 20 seconds (the preStop hook duration), the EndpointSlice should be updated to have the pod that is terminating removed. This is what happens when a pod gets removed for any other reason (e.g.
kubectl delete pod
).To ensure that everything else worked, you can reboot the node and see the journal logs (via something like
journalctl -b -1
) of the shutdown sequence triggering the kubelet graceful shutdown logic.Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: