PV nodeAffinity matchExpressions problem with array items and `in` operator since 1.27.0 #123465

dsgnr · 2024-02-23T08:10:47Z

What happened?

Since v1.27.0, the nodeAffinity for PersistentVolumes appear to have added extra validation on whether the values exist, despite using the in operator. This works as expected on 1.26.14, and no longer works on 1.27.0 onwards (tested up to 1.29.2).

I am unsure whether this is an intentional breaking change in 1.27.0 or something wrong in our manifests and we've just so happened to get away with it for a few years. The only thing that stands out to me in the changelog for 1.27.0 is some changes to the schedulers Filter plugin but this may well be unrelated.

What did you expect to happen?

When using the in operator, I'd expect the expression to only apply to nodes where the match is truey.

How can we reproduce it (as minimally and precisely as possible)?

Note, using Kind for easy reproduction

v1.26.14 - working

creating a kind cluster

cat <<EOF > kind_conf
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.26.14
- role: worker
  image: kindest/node:v1.26.14
- role: worker
  image: kindest/node:v1.26.14
EOF

$ kind create cluster --name 12614-test --config kind_conf
$ k get no
NAME                       STATUS   ROLES           AGE   VERSION
12614-test-control-plane   Ready    control-plane   8h    v1.26.14
12614-test-worker          Ready    <none>          8h    v1.26.14
12614-test-worker2         Ready    <none>          8h    v1.26.14
# labelling a node to ensure pod is scheduled on specific worker
$ k label node 12614-test-worker nginx=nginx

cat <<EOF > manifest.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nginx
                operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: default-nginx-data
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: default-nginx-data
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
      storageClassName: my_sc
      volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: default-nginx-data-nginx-0
  namespace: default
  labels:
    app: default-nginx-data-nginx-0
spec:
  storageClassName: my_sc
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /srv/k8s/default/nginx-0
  claimRef:
    namespace: default
    name: default-nginx-data-nginx-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - 12614-test-worker
            - 12614-prod-worker
EOF
# Apply the manifest
$ k apply -f manifest.yaml

Since 12614-test-worker exists in the hostname values of this cluster, the pod schedules as expected. 12614-prod-worker node does not exist in this environment, and so is ignored.

$ k get po
NAME      READY   STATUS    RESTARTS   AGE
nginx-0   1/1     Running   0          6m2s

v1.27.0 - not working

creating a kind cluster

cat <<EOF > kind_conf
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.27.0
- role: worker
  image: kindest/node:v1.27.0
- role: worker
  image: kindest/node:v1.27.0
EOF

$ kind create cluster --name 1270-test --config kind_conf
# labelling a node to ensure pod is scheduled on specific worker
$ k label node 1270-test-worker nginx=nginx

cat <<EOF > manifest.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nginx
                operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: default-nginx-data
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: default-nginx-data
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
      storageClassName: my_sc
      volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: default-nginx-data-nginx-0
  namespace: default
  labels:
    app: default-nginx-data-nginx-0
spec:
  storageClassName: my_sc
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /srv/k8s/default/nginx-0
  claimRef:
    namespace: default
    name: default-nginx-data-nginx-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - 1270-test-worker
            - 1270-prod-worker
EOF
# Apply the manifest
$ k apply -f manifest.yaml

Same expectations as 1.26.14, but in this case, the pod stays stuck in pending state due to the following;

$ k get no
NAME                      STATUS   ROLES           AGE     VERSION
1270-test-control-plane   Ready    control-plane   5m1s    v1.27.0
1270-test-worker          Ready    <none>          4m34s   v1.27.0
1270-test-worker2         Ready    <none>          4m39s   v1.27.0
$ k get po
NAME      READY   STATUS    RESTARTS   AGE
nginx-0   0/1     Pending   0          5m56s
$ kubectl events --for pod/nginx-0
LAST SEEN   TYPE      REASON             OBJECT        MESSAGE
3m6s        Warning   FailedScheduling   Pod/nginx-0   nodeinfo not found for node name "1270-prod-worker"

I have tested this all the way up to 1.29.2.

Anything else we need to know?

No response

Kubernetes version

v1.27.0 through to v1.29.2

Cloud provider

OS version

[root@stage1 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@stage1 ~]# uname -a
Linux stage1.domain.com 3.10.0-1160.108.1.el7.x86_64 #1 SMP Thu Jan 25 16:17:31 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-02-23T08:10:57Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vaibhav2107 · 2024-02-23T09:36:40Z

/sig storage

msarfaty · 2024-02-23T17:25:53Z

In a similar vein, we have noticed the same issue in K8s v1.27 when the kubernetes.io/hostname label doesn't match the node name. We see nodeinfo not found for <hostname>, despite (I think) the nodeinfo util supposed to be called using a node name.

PrasanaaV · 2024-02-26T20:51:25Z

/sig scheduling

chengjoey · 2024-04-18T07:40:26Z

This is because the non-existent node 1270-prod-worker is obtained from the cache, which has been fixed by #119779

before 119779, get node will return error:

kubernetes/pkg/scheduler/schedule_one.go

Lines 434 to 444 in bd2cb72

    
           nodes := allNodes 
        
           if !preRes.AllNodes() { 
        
           	nodes = make([]*framework.NodeInfo, 0, len(preRes.NodeNames)) 
        
           	for n := range preRes.NodeNames { 
        
           		nInfo, err := sched.nodeInfoSnapshot.NodeInfos().Get(n) 
        
           		if err != nil { 
        
           			return nil, diagnosis, err 
        
           		} 
        
           		nodes = append(nodes, nInfo) 
        
           	} 
        
           }

after 119779 it's fixed

kubernetes/pkg/scheduler/schedule_one.go

Lines 485 to 497 in da6be3b

    
           nodes := allNodes 
        
           if !preRes.AllNodes() { 
        
           	nodes = make([]*framework.NodeInfo, 0, len(preRes.NodeNames)) 
        
           	for _, n := range allNodes { 
        
           		if !preRes.NodeNames.Has(n.Node().Name) { 
        
           			// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable. 
        
           			// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption. 
        
           			diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result") 
        
           			continue 
        
           		} 
        
           		nodes = append(nodes, n) 
        
           	} 
        
           }

since version 1.30, it can run normally

@sanposhiho PTAL, need to cherry-pick for the previous version?

sanposhiho · 2024-04-20T05:35:00Z

Looks like my PR #119779 accidentally fixed this issue. 😅

380c7f2 is the trigger of this issue, introduced in v1.27; volumebinding plugin just returns node names in PreFilterResult based on nodeAffinity in PV, and thus PreFilterResult could contain node names actually doesn't exist.

@kubernetes/sig-scheduling-leads

One Line Summary: If nodeaffinity in PV has non-existing node name in In with kubernetes.io/hostname, the scheduling always fails.
And, this bug is already coincidentally fixed by #119779.

Questions:

In the first place, is it a bug worth cherry-picking?
If Yes, should we cherry-pick run all PreFilter when the preemption will happen later in the same scheduling cycle #119779, which actually isn't the PR motivated for this issue? Or should we go another way

alculquicondor · 2024-04-24T20:25:59Z

If it's a regression, we should consider cherry-pick.

🤔
#119779 looks big, but the actual code changes are about 10 lines (the rest are comments and tests), so I think it would be ok to cherry-pick.

Can you explain why your PR fixes the problem?

sanposhiho · 2024-04-25T02:46:26Z

Can you explain why your PR fixes the problem?

Before my PR, we take NodeInfo of nodes in preFilterResult via sched.nodeInfoSnapshot.NodeInfos().Get(n) (ref). Consequently, we could get not found errors from it when preFilterResult has non-existing node name(s).

And, after my PR, we are not using sched.nodeInfoSnapshot.NodeInfos().Get(n) anymore.

Huang-Wei · 2024-04-25T17:15:41Z

The pitfall is that when PreFilterResult was introduced, scheduler directly fetch the nodeNames returned by PreFilter() in its node snapshot, so technically speaking, when a plugin (in-tree or out-of-tree) return non-existent/illegal nodes, the pod scheduling flow will abort immediately.

And the PR #109877 happened to stomp on this pitfall.

alculquicondor · 2024-04-25T18:06:40Z

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

Since this is already released code, I would suggest cherry-picking as-is.

Also, worth testing E2E whether the issue reproduces in v1.30.

@dsgnr do you think you can give it a try?

Huang-Wei · 2024-04-25T18:30:08Z

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

yup, i'm in favor of picking the minimum fix, along with a minimum test #124539:

		for _, n := range allNodes {
			if !preRes.NodeNames.Has(n.Node().Name) {
				// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
				// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
				diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
				continue
			}
			nodes = append(nodes, n)

b/c in addition to the "fix" part, #119779 actually introduced a slight behavior change. @sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

Huang-Wei · 2024-04-25T18:30:41Z

Also, worth testing E2E whether the issue reproduces in v1.30.

#124539 can guard it.

chengjoey · 2024-04-26T05:55:07Z

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

yup, i'm in favor of picking the minimum fix, along with a minimum test #124539:
		for _, n := range allNodes {
			if !preRes.NodeNames.Has(n.Node().Name) {
				// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
				// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
				diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
				continue
			}
			nodes = append(nodes, n)
b/c in addition to the "fix" part, #119779 actually introduced a slight behavior change. @sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

@Huang-Wei @alculquicondor @sanposhiho Could you let me try it?，picking the minimum fix and add e2e test

chengjoey · 2024-04-26T09:00:43Z

A simplified version of the sample is used for testing,manifest.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /tmp/data
  claimRef:
    namespace: default
    name: data
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - 1280-test-worker
            - 1280-prod-worker

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data
  namespace: default
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  volumeName: data
---
apiVersion: v1
kind: Pod
metadata:
  name: pause
  labels:
    app: pause
spec:
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.6
    volumeMounts:
    - mountPath: /tmp
      name: data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data

also get this failedScheduling error, kubectl describe pod pause:

sanposhiho · 2024-04-26T09:08:51Z

@sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

@Huang-Wei That small fix should work.
I don't have a strong opinion on whether we should have a minimum fix vs cherry-pick as it is.
Defer to you both @alculquicondor @Huang-Wei on that point.

alculquicondor · 2024-04-26T13:48:11Z

Let's move ahead with the small fix, since we all agree on it.

sanposhiho · 2024-04-28T12:44:23Z

/reopen

Cherry-pick PR(s) aren't done

k8s-ci-robot · 2024-04-28T12:44:27Z

@sanposhiho: Reopened this issue.

In response to this:

/reopen

Cherry-pick PR(s) aren't done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dsgnr added the kind/bug Categorizes issue or PR as related to a bug. label Feb 23, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2024

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 23, 2024

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 26, 2024

Huang-Wei mentioned this issue Apr 25, 2024

test to catch illegal nodeNames returned by PreFilterResult #124539

Merged

chengjoey mentioned this issue Apr 26, 2024

hotfix when a plugin (in-tree or out-of-tree) return non-existent/illegal nodes, the pod scheduling flow will abort immediately. #124559

Merged

chengjoey mentioned this issue Apr 26, 2024

add e2e test when a plugin (in-tree or out-of-tree) return non-existent/illegal nodes, the pod should be scheduled normally if suitable nodes exists #124567

Merged

k8s-ci-robot closed this as completed in #124567 Apr 28, 2024

k8s-ci-robot reopened this Apr 28, 2024

chengjoey mentioned this issue May 2, 2024

add integration test for pod with pvc has node-affinity to non-existent/illegal nodes #124666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PV nodeAffinity matchExpressions problem with array items and `in` operator since 1.27.0 #123465

PV nodeAffinity matchExpressions problem with array items and `in` operator since 1.27.0 #123465

dsgnr commented Feb 23, 2024 •

edited

k8s-ci-robot commented Feb 23, 2024

vaibhav2107 commented Feb 23, 2024

msarfaty commented Feb 23, 2024 •

edited

PrasanaaV commented Feb 26, 2024

chengjoey commented Apr 18, 2024 •

edited

sanposhiho commented Apr 20, 2024 •

edited

alculquicondor commented Apr 24, 2024

sanposhiho commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024 •

edited

alculquicondor commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024

chengjoey commented Apr 26, 2024 •

edited

chengjoey commented Apr 26, 2024

sanposhiho commented Apr 26, 2024

alculquicondor commented Apr 26, 2024

sanposhiho commented Apr 28, 2024

k8s-ci-robot commented Apr 28, 2024

PV nodeAffinity matchExpressions problem with array items and in operator since 1.27.0 #123465

PV nodeAffinity matchExpressions problem with array items and in operator since 1.27.0 #123465

Comments

dsgnr commented Feb 23, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

v1.26.14 - working

creating a kind cluster

v1.27.0 - not working

creating a kind cluster

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Feb 23, 2024

vaibhav2107 commented Feb 23, 2024

msarfaty commented Feb 23, 2024 • edited

PrasanaaV commented Feb 26, 2024

chengjoey commented Apr 18, 2024 • edited

sanposhiho commented Apr 20, 2024 • edited

alculquicondor commented Apr 24, 2024

sanposhiho commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024 • edited

alculquicondor commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024

chengjoey commented Apr 26, 2024 • edited

chengjoey commented Apr 26, 2024

sanposhiho commented Apr 26, 2024

alculquicondor commented Apr 26, 2024

sanposhiho commented Apr 28, 2024

k8s-ci-robot commented Apr 28, 2024

PV nodeAffinity matchExpressions problem with array items and `in` operator since 1.27.0 #123465

PV nodeAffinity matchExpressions problem with array items and `in` operator since 1.27.0 #123465

dsgnr commented Feb 23, 2024 •

edited

msarfaty commented Feb 23, 2024 •

edited

chengjoey commented Apr 18, 2024 •

edited

sanposhiho commented Apr 20, 2024 •

edited

Huang-Wei commented Apr 25, 2024 •

edited

chengjoey commented Apr 26, 2024 •

edited