Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods fail with "NodeAffinity failed" after kubelet restarts #100467

Closed
ruiwen-zhao opened this issue Mar 23, 2021 · 28 comments
Closed

Pods fail with "NodeAffinity failed" after kubelet restarts #100467

ruiwen-zhao opened this issue Mar 23, 2021 · 28 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ruiwen-zhao
Copy link
Contributor

What happened:

The issue is basically same as #92067.

With the fix #94087 in place, kubelet will node lister to sync in GetNode().

However, in the case of kubelet restart, the pods scheduled on the node before the restart might still fail with "NodeAffinity failed" after the restart. Looking at the code, this is probably because the admit pod check (canAdmitPod()) might happen before GetNode().

What you expected to happen:

After kubelet restart, old pods (pods scheduled on the node before the restart) do not see "NodeAffinity failed".

How to reproduce it (as minimally and precisely as possible):

This issue does not happen all the time. To reproduce it, you will need to keep restarting the kubelet, and you might see a previously running Pod started to fail with "Predicate NodeAffinity failed".

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@ruiwen-zhao ruiwen-zhao added the kind/bug Categorizes issue or PR as related to a bug. label Mar 23, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 23, 2021
@k8s-ci-robot
Copy link
Contributor

@ruiwen-zhao: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

cc @neolit123 @ehashman
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 23, 2021
@phantooom
Copy link
Contributor

/assign

@neolit123
Copy link
Member

thank you for reporting this @ruiwen-zhao

@neolit123
Copy link
Member

neolit123 commented Mar 23, 2021

However, in the case of kubelet restart, the pods scheduled on the node before the restart might still fail with "NodeAffinity failed" after the restart. Looking at the code, this is probably because the admit pod check (canAdmitPod()) might happen before GetNode().

from my understanding this should be resolved by #99336 in its current state?

@lwsanty
Copy link

lwsanty commented Apr 27, 2021

issue is still reproduced in GKE 1.19.8-gke.1600

@Keramblock
Copy link

issue is still reproduced in GKE 1.19.8-gke.1600

same

@neolit123
Copy link
Member

GKE builds are proprietary.

for upstream Kubernetes once these cherry picks merge you should check the latest respective PATCH releases:
#99336 (comment)

@Keramblock
Copy link

@neolit123 Still, they based on open source, or aren't they?

@neolit123
Copy link
Member

my point is that i cannot give you a timeline of when the GKE builds will be available and you should consult with GKE support.

k8s patch releases should be up on 12th of May:
https://groups.google.com/g/kubernetes-dev/c/H06vjjSzX44/m/rZcBO0_rAAAJ

@Keramblock
Copy link

Oh, ok. Thank you =)

@primeroz
Copy link

FYI this is also present in gke 1.18.17-gke.700 , i did hope they would backport the patch since yesterday the .700 was released to stable channel but that is not the case.

Luckily, for us, this is only an issue with preemptible nodes since that is effectively a node restart

Will wait for 1.18.19 impatiently. 🤞

@sbocinec
Copy link

We got first affected by this issue after upgrading our GKE cluster from v1.17.17-gke.2800 to 1.18.17-gke.700 for pods running on pre-emptible nodes. Is this k8s 1.18+ specific?

@primeroz
Copy link

primeroz commented May 26, 2021

Same here, as far as i know this is fixed in 1.18.19

fix in #99336 (comment) cherrypicked to 1.18 in #101343

also affectx up to 1.21 btw, check that PR to see the commit for each version

@pacoxu
Copy link
Member

pacoxu commented Jun 25, 2021

It should be fixed in 1.18.19, v1.19.10, v1.20.7 and v1.21.1.

For GKE upgrade, I think it should be asked in GKE service?
/triage duplicate
/close

@k8s-ci-robot k8s-ci-robot added the triage/duplicate Indicates an issue is a duplicate of other open issue. label Jun 25, 2021
@k8s-ci-robot
Copy link
Contributor

@pacoxu: Closing this issue.

In response to this:

It should be fixed in 1.18.19, v1.19.10, v1.20.7 and v1.21.1.

For GKE upgrade, I think it should be asked in GKE service?
/triage duplicate
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@username1366
Copy link

After upgrading GKE to v1.18.19-gke.1700 I experienced the same issue - some of the pods after node preemption moved to NodeAffinity status

kubectl get pods -o wide --all-namespaces | grep NodeAffinity
app              app-cd5d5595f-tkw9p                          0/5     NodeAffinity

@primeroz
Copy link

On GKE i also tested with v1.19.10-gke.1600 and getting plenty of nodeaffinity pods

@sbocinec
Copy link

sbocinec commented Jul 26, 2021

Tested with the latest v1.18.20-gke.900 and the issue is still present. @phantooom is it possible to reopen the issue as the fix appears not to fix it?

@miguepintor
Copy link

on GKE v1.19.11-gke.2101 is reproducible as well, please @phantooom consider the re-opening

@alculquicondor
Copy link
Member

/reopen
@neolit123 any ideas?

@k8s-ci-robot k8s-ci-robot reopened this Aug 4, 2021
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Reopened this issue.

In response to this:

/reopen
@neolit123 any ideas?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123
Copy link
Member

sounds like something that regressed after the node sync changes, but the second one (that i did) did not fix it.

the change that i did:
#99336

was technically a refactor on what was already established by the previous change:
#94087

However, in the case of kubelet restart, the pods scheduled on the node before the restart might still fail with "NodeAffinity failed" after the restart. Looking at the code, this is probably because the admit pod check (canAdmitPod()) might happen before GetNode().

this seems racy and should be brought to discussion at the SIG Node meeting
https://github.com/kubernetes/community/tree/master/sig-node#meetings

kubelet maintainers that are more savvy must be able to reproduce it:

This issue does not happen all the time. To reproduce it, you will need to keep restarting the kubelet, and you might see a previously running Pod started to fail with "Predicate NodeAffinity failed".

we have a lot of GKE reporters in this ticket. has anyone seen the problem on non-GKE clusters?

@ehashman
Copy link
Member

ehashman commented Aug 4, 2021

/remove-triage duplicate
/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed triage/duplicate Indicates an issue is a duplicate of other open issue. labels Aug 4, 2021
@ehashman ehashman added this to Needs Information in SIG Node Bugs Aug 5, 2021
@Tfmenard
Copy link

Tfmenard commented Aug 12, 2021

For affected GKE users, graceful node termination feature fixes the issue and is enabled on clusters running node pools on 1.20+

Note that this issue has little to no impact on workloads. As long as the pod is backed by controller (deployment/statefulset, etc), when a pod runs into the NodeAffinity issue a new pod is immediately created and rescheduled.
See https://issuetracker.google.com/185362914 for details.

In case this issue comes back, to reproduce it simply create a cluster with a node pool that runs preemptible VMs and deploy this simple deployment [1] that uses nodeSelector.
Then to simulate a preemption run
gcloud compute instances simulate-maintenance-event <node-name> --zone <zone name>
If you're lucky the issue will occur on the first preemption, but it may only occur on the 10th one.

I think we should close this one as we've only seen this occur on GKE and a fix is out now for GKE.
Please reopen if you experience the issue on GKE node pools running 1.20+ or on non-GKE clusters.

[1]

apiVersion: apps/v1
kind: Deployment
metadata:
  name: na-test
spec:
  replicas: 5
  selector:
    matchLabels:
      role: na-test
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1000
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: na-test
    spec:
      containers:
      - image: busybox
        command:
          - sh
          - -c
          - 'echo "NodeAffinity test"; sleep 300;'
        imagePullPolicy: IfNotPresent
        name: busybox
      nodeSelector:
        cloud.google.com/gke-nodepool: <node pool name>

/close

@k8s-ci-robot
Copy link
Contributor

@Tfmenard: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

For affected GKE users, graceful node termination feature fixes the issue and is enabled on clusters running node pools on 1.20+

Note that this issue has little to no impact on workloads. As long as the pod is backed by controller (deployment/statefulset, etc), when a pod runs into the NodeAffinity issue a new pod is immediately created and rescheduled.
See https://issuetracker.google.com/185362914 for details.

In case this issue comes back, to reproduce it simply create a cluster with a node pool that runs preemptible VMs and deploy this simple deployment [1] that uses nodeSelector.
Then to simulate a preemption run
gcloud compute instances simulate-maintenance-event <node-name> --zone <zone name>
If you're lucky the issue will occur on the first preemption, but it may only occur on the 10th one.

I'm closing this issue as we've only seen this occur on GKE and a fix is out now for GKE.
Please reopen if you experience the issue on GKE node pools running 1.20+ or on non-GKE clusters.

[1]

apiVersion: apps/v1
kind: Deployment
metadata:
 name: na-test
spec:
 replicas: 5
 selector:
   matchLabels:
     role: na-test
 strategy:
   rollingUpdate:
     maxSurge: 0
     maxUnavailable: 1000
   type: RollingUpdate
 template:
   metadata:
     labels:
       role: na-test
   spec:
     containers:
     - image: busybox
       command:
         - sh
         - -c
         - 'echo "NodeAffinity test"; sleep 300;'
       imagePullPolicy: IfNotPresent
       name: busybox
     nodeSelector:
       cloud.google.com/gke-nodepool: <node pool name>

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SergeyKanzhelev
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@SergeyKanzhelev: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node Bugs automation moved this from Needs Information to Done Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests