New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods fail with "NodeAffinity failed" after kubelet restarts #100467
Comments
@ruiwen-zhao: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @neolit123 @ehashman |
/assign |
thank you for reporting this @ruiwen-zhao |
from my understanding this should be resolved by #99336 in its current state? |
issue is still reproduced in GKE |
same |
GKE builds are proprietary. for upstream Kubernetes once these cherry picks merge you should check the latest respective PATCH releases: |
@neolit123 Still, they based on open source, or aren't they? |
my point is that i cannot give you a timeline of when the GKE builds will be available and you should consult with GKE support. k8s patch releases should be up on 12th of May: |
Oh, ok. Thank you =) |
FYI this is also present in gke Luckily, for us, this is only an issue with Will wait for 1.18.19 impatiently. 🤞 |
We got first affected by this issue after upgrading our GKE cluster from |
Same here, as far as i know this is fixed in 1.18.19 fix in #99336 (comment) cherrypicked to 1.18 in #101343 also affectx up to 1.21 btw, check that PR to see the commit for each version |
@pacoxu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
After upgrading GKE to v1.18.19-gke.1700 I experienced the same issue - some of the pods after node preemption moved to NodeAffinity status
|
On GKE i also tested with |
Tested with the latest |
on GKE |
/reopen |
@alculquicondor: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
sounds like something that regressed after the node sync changes, but the second one (that i did) did not fix it. the change that i did: was technically a refactor on what was already established by the previous change:
this seems racy and should be brought to discussion at the SIG Node meeting kubelet maintainers that are more savvy must be able to reproduce it:
we have a lot of GKE reporters in this ticket. has anyone seen the problem on non-GKE clusters? |
/remove-triage duplicate |
For affected GKE users, graceful node termination feature fixes the issue and is enabled on clusters running node pools on 1.20+ Note that this issue has little to no impact on workloads. As long as the pod is backed by controller (deployment/statefulset, etc), when a pod runs into the NodeAffinity issue a new pod is immediately created and rescheduled. In case this issue comes back, to reproduce it simply create a cluster with a node pool that runs preemptible VMs and deploy this simple deployment [1] that uses nodeSelector. I think we should close this one as we've only seen this occur on GKE and a fix is out now for GKE. [1]
/close |
@Tfmenard: You can't close an active issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/close |
@SergeyKanzhelev: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
The issue is basically same as #92067.
With the fix #94087 in place, kubelet will node lister to sync in GetNode().
However, in the case of kubelet restart, the pods scheduled on the node before the restart might still fail with "NodeAffinity failed" after the restart. Looking at the code, this is probably because the admit pod check (canAdmitPod()) might happen before GetNode().
What you expected to happen:
After kubelet restart, old pods (pods scheduled on the node before the restart) do not see "NodeAffinity failed".
How to reproduce it (as minimally and precisely as possible):
This issue does not happen all the time. To reproduce it, you will need to keep restarting the kubelet, and you might see a previously running Pod started to fail with "Predicate NodeAffinity failed".
Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: