Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarted kubelet should not evict pods due to the node affinity #124586

Open
gjkim42 opened this issue Apr 28, 2024 · 18 comments
Open

Restarted kubelet should not evict pods due to the node affinity #124586

gjkim42 opened this issue Apr 28, 2024 · 18 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@gjkim42
Copy link
Member

gjkim42 commented Apr 28, 2024

What happened?

Currently, a running pod is being evicted if the assigned node, which does not meet the pod's node affinity requirements anymore, restarts its kubelet.

xref: #123980 (comment)

What did you expect to happen?

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

Screenshot 2024-04-23 at 3 51 20 PM

According to the documentation, node affinity should only affect the scheduling phase. Therefore, the running pod should continue to run without interruption.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a deployment with node affinity.
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: foo
                operator: In
                values:
                - bar
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
  1. kubectl label node TARGET_NODE foo=bar --overwrite so that the pod is scheduled in the TARGET_NODE.
  2. Waiting for the pod to be scheduled and running.
  3. kubectl label node TARGET_NODE foo=nonbar --overwrite
  4. Restart the kubelet of the TARGET_NODE.

Anything else we need to know?

previous discussion: #101218 (comment)

Kubernetes version

$ kubectl version
# paste output here
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.0-alpha.0.470+7d880fd489d1e5-dirty

Cloud provider

kind cluster

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@gjkim42 gjkim42 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 28, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2024
@gjkim42
Copy link
Member Author

gjkim42 commented Apr 28, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 28, 2024
@gjkim42
Copy link
Member Author

gjkim42 commented Apr 28, 2024

// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to an update), the system
// may or may not try to eventually evict the pod from its node.
// +optional
RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty" protobuf:"bytes,1,opt,name=requiredDuringSchedulingIgnoredDuringExecution"`

^
The other document is saying that the pod may be evicted due to the node affinity.

@gjkim42
Copy link
Member Author

gjkim42 commented Apr 28, 2024

We can fix the behavior or documents.

The fix seems simple but may break many existing use cases. We may need some discussion about this.

@HirazawaUi
Copy link
Contributor

/cc

@ffromani
Copy link
Contributor

/triage accepted

In hindsight, the labelling part (xref: #123980 (comment)) deserves its own issue.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2024
@AxeZhan
Copy link
Member

AxeZhan commented Apr 29, 2024

/cc

1 similar comment
@Homura222
Copy link

/cc

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs May 1, 2024
@AnishShah AnishShah moved this from Triage to Triaged in SIG Node Bugs May 1, 2024
@AnishShah
Copy link
Contributor

/assign @gjkim42 gjkim42

@gjkim42
Copy link
Member Author

gjkim42 commented May 1, 2024

/cc @liggitt @alculquicondor

since we had a similar discussion before in #101218 (comment)

What do you think about this?

@alculquicondor
Copy link
Member

I guess the definition of scheduling time is a bit fuzzy.

The verification is happening both in the scheduler and in kubelet, at startup, but not during execution.

Yes, we can update the documentation to be more clear, but we can't change the behavior.

@rmohr
Copy link
Contributor

rmohr commented May 6, 2024

The verification is happening both in the scheduler and in kubelet, at startup, but not during execution.

Yes, we can update the documentation to be more clear, but we can't change the behavior.

Do you mean you can not change the behaviour to support IgnoreDuringExecution as documented right now? If so, could you elaborate on that a little bit?

@alculquicondor
Copy link
Member

We are ignoring during execution.
In the case reported, the pod is marked as Failed before execution.

In other words, we are honoring what the API says. But there is an intermediate stage that the API doesn't say anything about: the time between pod scheduled and before it starts executing.

That's what needs to be clarified in the documentation, and it should match the existing behavior that has been there since this feature first launched. We cannot change that behavior or it would be backwards incompatible.

@rmohr
Copy link
Contributor

rmohr commented May 6, 2024

In the case reported, the pod is marked as Failed before execution.

Ok, trying to rephrase with my own words to be sure I understand:

  1. kubelet stop
  2. pod gets scheduled (but not started yet)
  3. kubelet starts and sees that the pod should not be started here (and is not yet executing)
  4. kubelet stops it

That sounds fair to keep.

@Homura222
Copy link

Homura222 commented May 9, 2024

if pod use preferredDuringSchedulingIgnoredDuringExecution nodeAffinity , pod will not be killed in this case. Is this expected?

@alculquicondor
Copy link
Member

Yes

@rmohr
Copy link
Contributor

rmohr commented May 9, 2024

@alculquicondor, @Homura222 did some experiements (kubevirt/kubevirt#11843 (comment)) and it turns out that requiredDuringSchedulingIgnoredDuringExecution does however evict pods when the kubelet restarts. Is this also expected? This does not seem to fall into this category:

In other words, we are honoring what the API says. But there is an intermediate stage that the API doesn't say anything about: the time between pod scheduled and before it starts executing.

and I would expect that it "ignores" the label changes during the execution phase of the pod. Any thoughts on that?

@gjkim42
Copy link
Member Author

gjkim42 commented May 10, 2024

In other words, we are honoring what the API says. But there is an intermediate stage that the API doesn't say anything about: the time between pod scheduled and before it starts executing.

Then, is it ok to evict the executing(running) pod only after restarting the kubelet?
I think this is a bit of weird behavior for me. If we want to apply the validation right before starting the pod, we shouldn't validate the running pod again due to the kubelet restart.

If we have to document the behavior as it works now, requiredDuringSchedulingIgnoredDuringExecution affects the pod during the scheduling time, the pod start time, AND the kubelet restart time. (kubelet restart time? why?)

@alculquicondor
Copy link
Member

@SergeyKanzhelev can you chime in the kubelet details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

9 participants