Restarted kubelet should not evict pods due to the node affinity #124586

gjkim42 · 2024-04-28T05:20:49Z

What happened?

Currently, a running pod is being evicted if the assigned node, which does not meet the pod's node affinity requirements anymore, restarts its kubelet.

xref: #123980 (comment)

What did you expect to happen?

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

According to the documentation, node affinity should only affect the scheduling phase. Therefore, the running pod should continue to run without interruption.

How can we reproduce it (as minimally and precisely as possible)?

Create a deployment with node affinity.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: foo
                operator: In
                values:
                - bar
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

kubectl label node TARGET_NODE foo=bar --overwrite so that the pod is scheduled in the TARGET_NODE.
Waiting for the pod to be scheduled and running.
kubectl label node TARGET_NODE foo=nonbar --overwrite
Restart the kubelet of the TARGET_NODE.

Anything else we need to know?

previous discussion: #101218 (comment)

Kubernetes version

$ kubectl version
# paste output here
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.0-alpha.0.470+7d880fd489d1e5-dirty

Cloud provider

kind cluster

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

gjkim42 · 2024-04-28T05:22:52Z

/sig node

gjkim42 · 2024-04-28T05:44:26Z

kubernetes/staging/src/k8s.io/api/core/v1/types.go

Lines 3458 to 3464 in 7d880fd

    
           // If the affinity requirements specified by this field are not met at 
        
           // scheduling time, the pod will not be scheduled onto the node. 
        
           // If the affinity requirements specified by this field cease to be met 
        
           // at some point during pod execution (e.g. due to an update), the system 
        
           // may or may not try to eventually evict the pod from its node. 
        
           // +optional 
        
           RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty" protobuf:"bytes,1,opt,name=requiredDuringSchedulingIgnoredDuringExecution"`

^
The other document is saying that the pod may be evicted due to the node affinity.

gjkim42 · 2024-04-28T05:49:09Z

We can fix the behavior or documents.

The fix seems simple but may break many existing use cases. We may need some discussion about this.

HirazawaUi · 2024-04-28T17:12:00Z

/cc

ffromani · 2024-04-29T08:04:16Z

/triage accepted

In hindsight, the labelling part (xref: #123980 (comment)) deserves its own issue.

AxeZhan · 2024-04-29T12:40:49Z

/cc

Homura222 · 2024-04-30T02:22:53Z

/cc

AnishShah · 2024-05-01T17:32:24Z

/assign @gjkim42 gjkim42

gjkim42 · 2024-05-01T17:38:57Z

/cc @liggitt @alculquicondor

since we had a similar discussion before in #101218 (comment)

What do you think about this?

alculquicondor · 2024-05-01T17:55:41Z

I guess the definition of scheduling time is a bit fuzzy.

The verification is happening both in the scheduler and in kubelet, at startup, but not during execution.

Yes, we can update the documentation to be more clear, but we can't change the behavior.

rmohr · 2024-05-06T16:05:09Z

The verification is happening both in the scheduler and in kubelet, at startup, but not during execution.

Yes, we can update the documentation to be more clear, but we can't change the behavior.

Do you mean you can not change the behaviour to support IgnoreDuringExecution as documented right now? If so, could you elaborate on that a little bit?

alculquicondor · 2024-05-06T16:11:16Z

We are ignoring during execution.
In the case reported, the pod is marked as Failed before execution.

In other words, we are honoring what the API says. But there is an intermediate stage that the API doesn't say anything about: the time between pod scheduled and before it starts executing.

That's what needs to be clarified in the documentation, and it should match the existing behavior that has been there since this feature first launched. We cannot change that behavior or it would be backwards incompatible.

rmohr · 2024-05-06T16:18:49Z

In the case reported, the pod is marked as Failed before execution.

Ok, trying to rephrase with my own words to be sure I understand:

kubelet stop
pod gets scheduled (but not started yet)
kubelet starts and sees that the pod should not be started here (and is not yet executing)
kubelet stops it

That sounds fair to keep.

Homura222 · 2024-05-09T06:18:01Z

if pod use preferredDuringSchedulingIgnoredDuringExecution nodeAffinity , pod will not be killed in this case. Is this expected?

alculquicondor · 2024-05-09T20:29:03Z

Yes

rmohr · 2024-05-09T20:37:34Z

@alculquicondor, @Homura222 did some experiements (kubevirt/kubevirt#11843 (comment)) and it turns out that requiredDuringSchedulingIgnoredDuringExecution does however evict pods when the kubelet restarts. Is this also expected? This does not seem to fall into this category:

In other words, we are honoring what the API says. But there is an intermediate stage that the API doesn't say anything about: the time between pod scheduled and before it starts executing.

and I would expect that it "ignores" the label changes during the execution phase of the pod. Any thoughts on that?

gjkim42 · 2024-05-10T02:49:35Z

In other words, we are honoring what the API says. But there is an intermediate stage that the API doesn't say anything about: the time between pod scheduled and before it starts executing.

Then, is it ok to evict the executing(running) pod only after restarting the kubelet?
I think this is a bit of weird behavior for me. If we want to apply the validation right before starting the pod, we shouldn't validate the running pod again due to the kubelet restart.

If we have to document the behavior as it works now, requiredDuringSchedulingIgnoredDuringExecution affects the pod during the scheduling time, the pod start time, AND the kubelet restart time. (kubelet restart time? why?)

alculquicondor · 2024-05-10T12:24:11Z

@SergeyKanzhelev can you chime in the kubelet details?

gjkim42 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 28, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2024

gjkim42 mentioned this issue Apr 28, 2024

restarting a kubelet should never affect the running workload #123980

Open

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 28, 2024

Homura222 mentioned this issue Apr 28, 2024

vm restart when kubelet restart kubevirt/kubevirt#11662

Open

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024

Homura222 mentioned this issue Apr 29, 2024

restarting a kubelet should never affect the running workload when node label is not match pod #124367

Open

gjkim42 mentioned this issue Apr 29, 2024

add e2e test for restart kubelet #124445

Open

SergeyKanzhelev added this to Triage in SIG Node Bugs May 1, 2024

AnishShah moved this from Triage to Triaged in SIG Node Bugs May 1, 2024

k8s-ci-robot assigned gjkim42 May 1, 2024

Homura222 mentioned this issue May 3, 2024

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. kubevirt/kubevirt#11843

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarted kubelet should not evict pods due to the node affinity #124586

Restarted kubelet should not evict pods due to the node affinity #124586

gjkim42 commented Apr 28, 2024 •

edited

gjkim42 commented Apr 28, 2024

gjkim42 commented Apr 28, 2024

gjkim42 commented Apr 28, 2024

HirazawaUi commented Apr 28, 2024

ffromani commented Apr 29, 2024

AxeZhan commented Apr 29, 2024 •

edited

Homura222 commented Apr 30, 2024

AnishShah commented May 1, 2024

gjkim42 commented May 1, 2024 •

edited

alculquicondor commented May 1, 2024

rmohr commented May 6, 2024

alculquicondor commented May 6, 2024

rmohr commented May 6, 2024

Homura222 commented May 9, 2024 •

edited

alculquicondor commented May 9, 2024

rmohr commented May 9, 2024

gjkim42 commented May 10, 2024

alculquicondor commented May 10, 2024

Restarted kubelet should not evict pods due to the node affinity #124586

Restarted kubelet should not evict pods due to the node affinity #124586

Comments

gjkim42 commented Apr 28, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

gjkim42 commented Apr 28, 2024

gjkim42 commented Apr 28, 2024

gjkim42 commented Apr 28, 2024

HirazawaUi commented Apr 28, 2024

ffromani commented Apr 29, 2024

AxeZhan commented Apr 29, 2024 • edited

Homura222 commented Apr 30, 2024

AnishShah commented May 1, 2024

gjkim42 commented May 1, 2024 • edited

alculquicondor commented May 1, 2024

rmohr commented May 6, 2024

alculquicondor commented May 6, 2024

rmohr commented May 6, 2024

Homura222 commented May 9, 2024 • edited

alculquicondor commented May 9, 2024

rmohr commented May 9, 2024

gjkim42 commented May 10, 2024

alculquicondor commented May 10, 2024

gjkim42 commented Apr 28, 2024 •

edited

AxeZhan commented Apr 29, 2024 •

edited

gjkim42 commented May 1, 2024 •

edited

Homura222 commented May 9, 2024 •

edited