Some nodes are not considered in scheduling when there is zone imbalance #91601

zetaab · 2020-05-30T09:04:36Z

What happened: We upgraded 15 kubernetes clusters from 1.17.5 to 1.18.2/1.18.3 and started to see that daemonsets does not work properly anymore.

The problem is that all daemonset pods does not provision. It will return following error message to events:

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  9s (x5 over 71s)  default-scheduler  0/13 nodes are available: 12 node(s) didn't match node selector.

However, all nodes are available and it does not have node selector. Nodes does not have taints either.

daemonset https://gist.github.com/zetaab/4a605cb3e15e349934cb7db29ec72bd8

% kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
e2etest-1-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-2-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-3-kaasprod-k8s-local           Ready    node     44h   v1.18.3
e2etest-4-kaasprod-k8s-local           Ready    node     44h   v1.18.3
master-zone-1-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-2-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-3-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
nodes-z1-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z1-2-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z2-1-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z2-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z3-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z3-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3

% kubectl get pods -n weave -l weave-scope-component=agent -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP           NODE                                   NOMINATED NODE   READINESS GATES
weave-scope-agent-2drzw   1/1     Running   0          26h     10.1.32.23   e2etest-1-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-4kpxc   1/1     Running   3          26h     10.1.32.12   nodes-z1-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-78n7r   1/1     Running   0          26h     10.1.32.7    e2etest-4-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-9m4n8   1/1     Running   0          26h     10.1.96.4    master-zone-1-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-b2gnk   1/1     Running   1          26h     10.1.96.12   master-zone-3-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-blwtx   1/1     Running   2          26h     10.1.32.20   nodes-z1-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-cbhjg   1/1     Running   0          26h     10.1.64.15   e2etest-2-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-csp49   1/1     Running   0          26h     10.1.96.14   e2etest-3-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-g4k2x   1/1     Running   1          26h     10.1.64.10   nodes-z2-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-kx85h   1/1     Running   2          26h     10.1.96.6    nodes-z3-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-lllqc   0/1     Pending   0          5m56s   <none>       <none>                                 <none>           <none>
weave-scope-agent-nls2h   1/1     Running   0          26h     10.1.96.17   master-zone-2-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-p8njs   1/1     Running   2          26h     10.1.96.19   nodes-z3-2-kaasprod-k8s-local          <none>           <none>

I have tried to restart apiserver/schedulers/controller-managers but it does not help. Also I have tried to restart that single node that is stuck (nodes-z2-1-kaasprod-k8s-local) but it does not help either. Only deleting that node and recreating it helps.

% kubectl describe node nodes-z2-1-kaasprod-k8s-local
Name:               nodes-z2-1-kaasprod-k8s-local
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=zone-2
                    kops.k8s.io/instancegroup=nodes-z2
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=nodes-z2-1-kaasprod-k8s-local
                    kubernetes.io/os=linux
                    kubernetes.io/role=node
                    node-role.kubernetes.io/node=
                    node.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    topology.cinder.csi.openstack.org/zone=zone-2
                    topology.kubernetes.io/region=regionOne
                    topology.kubernetes.io/zone=zone-2
Annotations:        csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"faf14d22-010f-494a-9b34-888bdad1d2df"}
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.1.64.32/19
                    projectcalico.org/IPv4IPIPTunnelAddr: 100.98.136.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 28 May 2020 13:28:24 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  nodes-z2-1-kaasprod-k8s-local
  AcquireTime:     <unset>
  RenewTime:       Sat, 30 May 2020 12:02:13 +0300
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 29 May 2020 09:40:51 +0300   Fri, 29 May 2020 09:40:51 +0300   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.1.64.32
  Hostname:    nodes-z2-1-kaasprod-k8s-local
Capacity:
  cpu:                4
  ephemeral-storage:  10287360Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8172420Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  9480830961
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8070020Ki
  pods:               110
System Info:
  Machine ID:                 c94284656ff04cf090852c1ddee7bcc2
  System UUID:                faf14d22-010f-494a-9b34-888bdad1d2df
  Boot ID:                    295dc3d9-0a90-49ee-92f3-9be45f2f8e3d
  Kernel Version:             4.19.0-8-cloud-amd64
  OS Image:                   Debian GNU/Linux 10 (buster)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.8
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
PodCIDR:                      100.96.12.0/24
PodCIDRs:                     100.96.12.0/24
ProviderID:                   openstack:///faf14d22-010f-494a-9b34-888bdad1d2df
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-77pqs                           100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  kube-system                 kube-proxy-nodes-z2-1-kaasprod-k8s-local    100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  volume                      csi-cinder-nodeplugin-5jbvl                 100m (2%)     400m (10%)  200Mi (2%)       200Mi (2%)     46h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                300m (7%)   800m (20%)
  memory             400Mi (5%)  400Mi (5%)
  ephemeral-storage  0 (0%)      0 (0%)
Events:
  Type    Reason                   Age    From                                    Message
  ----    ------                   ----   ----                                    -------
  Normal  Starting                 7m27s  kubelet, nodes-z2-1-kaasprod-k8s-local  Starting kubelet.
  Normal  NodeHasSufficientMemory  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Updated Node Allocatable limit across pods

We are seeing this randomly in all of our clusters.

What you expected to happen: I expect that daemonset will provision to all nodes.

How to reproduce it (as minimally and precisely as possible): No idea really, install 1.18.x kubernetes and deploy daemonset and after that wait days(?)

Anything else we need to know?: When this happens we cannot provision any other daemonsets to that node either. Like you can see logging fluent-bit is also missing. I cannot see any errors in that node kubelet logs and like said, restarting does not help.

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        12      13           12          <none>                            337d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   174d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   193d
kube-system   metricbeat                 6         6         5       6            5           <none>                            35d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   337d
logging       fluent-bit                 13        13        12      13           12          <none>                            337d
monitoring    node-exporter              13        13        12      13           12          kubernetes.io/os=linux            58d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        12      13           12          <none>                            193d
weave         weavescope-iowait-plugin   6         6         5       6            5           <none>                            193d

Like you can see, most of the daemonsets are missing one pod

Environment:

Kubernetes version (use kubectl version): 1.18.3
Cloud provider or hardware configuration: openstack
OS (e.g: cat /etc/os-release): debian buster
Kernel (e.g. uname -a): Linux nodes-z2-1-kaasprod-k8s-local 4.19.0-8-cloud-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/Linux
Install tools: kops
Network plugin and version (if this is a network-related bug): calico
Others:

The text was updated successfully, but these errors were encountered:

zetaab · 2020-05-30T09:08:36Z

/sig scheduling

liggitt · 2020-05-30T15:30:12Z

Can you provide the full yaml of the node, daemonset, an example pod, and the containing namespace as retrieved from the server?

zetaab · 2020-05-30T16:06:06Z

node:
https://gist.github.com/zetaab/2a7e8d3fe6cb42a617e17abc0fa375f7

daemonset:
https://gist.github.com/zetaab/31bb406c8bd622b3017bf4f468d0154f

example pod (working):
https://gist.github.com/zetaab/814871bec6f2879e371f5bbdc6f2e978

example pod (not scheduling):
https://gist.github.com/zetaab/f3488d65486c745af78dbe2e6173fd42

namespace:
https://gist.github.com/zetaab/4625b759f4e21b50757c79e5072cd7d9

liggitt · 2020-05-30T16:39:34Z

DaemonSet pods schedule with a nodeAffinity selector that only matches a single node, so the "12 out of 13 didn't match" message is expected.

liggitt · 2020-05-30T16:45:40Z

I don't see a reason why the scheduler would be unhappy with the pod/node combo… there's no ports that could conflict in the podspec, the node is not unschedulable or tainted, and has sufficient resources

zetaab · 2020-05-30T17:21:42Z

Okay I restarted all 3 schedulers (changed loglevel to 4 if we can see something interesting there). However, it fixed the issue

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        13      13           13          <none>                            338d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   175d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   194d
kube-system   metricbeat                 6         6         6       6            6           <none>                            36d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   338d
logging       fluent-bit                 13        13        13      13           13          <none>                            338d
monitoring    node-exporter              13        13        13      13           13          kubernetes.io/os=linux            59d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        13      13           13          <none>                            194d
weave         weavescope-iowait-plugin   6         6         6       6            6           <none>                            194d

now all daemonsets are provisioned correctly. Weird, anyways something wrong with the scheduler it seems

liggitt · 2020-05-30T18:33:38Z

cc @kubernetes/sig-scheduling-bugs @ahg-g

jejer · 2020-06-01T01:35:43Z

We see same similar issue on v1.18.3, one node cannot be scheduled for daemonset pods.
restart scheduler helps.

[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get pod -A|grep Pending
kube-system   coredns-vc5ws                                                 0/1     Pending   0          2d16h
kube-system   local-volume-provisioner-mwk88                                0/1     Pending   0          2d16h
kube-system   svcwatcher-ltqb6                                              0/1     Pending   0          2d16h
ncms          bcmt-api-hfzl6                                                0/1     Pending   0          2d16h
ncms          bcmt-yum-repo-589d8bb756-5zbvh                                0/1     Pending   0          2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get ds -A
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-system   coredns                    3         3         2       3            2           is_control=true                 2d16h
kube-system   danmep-cleaner             0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   kube-proxy                 8         8         8       8            8           <none>                          2d16h
kube-system   local-volume-provisioner   8         8         7       8            7           <none>                          2d16h
kube-system   netwatcher                 0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   sriov-device-plugin        0         0         0       0            0           sriov=enabled                   2d16h
kube-system   svcwatcher                 3         3         2       3            2           is_control=true                 2d16h
ncms          bcmt-api                   3         3         0       3            0           is_control=true                 2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get node
NAME                                  STATUS   ROLES    AGE     VERSION
tesla-cb0434-csfp1-csfp1-control-01   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-02   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-03   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-01      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-02      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-01    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-02    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-03    Ready    <none>   2d16h   v1.18.3

ahg-g · 2020-06-01T02:20:10Z

Hard to debug without knowing how to repreduce. Do you have the scheduler logs by any chance for the failed to schedule pod?

ahg-g · 2020-06-01T02:21:00Z

Okay I restarted all 3 schedulers

I assume only one of them is named default-scheduler, correct?

changed loglevel to 4 if we can see something interesting there

Can you share what you noticed?

jejer · 2020-06-01T02:48:08Z

set loglevel to 9, but it seems there is nothing more interesting, below logs are looping.

I0601 01:45:05.039373       1 generic_scheduler.go:290] Preemption will not help schedule pod kube-system/coredns-vc5ws on any node.
I0601 01:45:05.039437       1 factory.go:462] Unable to schedule kube-system/coredns-vc5ws: no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting
I0601 01:45:05.039494       1 scheduler.go:776] Updating pod condition for kube-system/coredns-vc5ws to (PodScheduled==False, Reason=Unschedulable)

zetaab · 2020-06-01T04:42:40Z

yeah I could not see anything more than same line

no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting

ahg-g · 2020-06-01T13:19:52Z

the strange thing is that the log message is showing the result for 7 nodes only, like the issue reported in #91340

ahg-g · 2020-06-01T13:23:18Z

/cc @damemi

damemi · 2020-06-01T13:40:09Z

@ahg-g this does look like the same issue I reported there, it seems like we either have a filter plugin that doesn't always report its error or some other condition that's failing silently if I had to guess

damemi · 2020-06-01T14:08:35Z

Note that in my issue, restarting the scheduler also fixed it (as mentioned in this thread too #91601 (comment))

Mine was also about a daemonset, so I think this is a duplicate. If that's the case we can close this and continue discussion in #91340

zetaab · 2020-06-01T14:34:43Z

Anyways scheduler needs more verbose logging option, its impossible to debug these issues if there are not logs about what it does

damemi · 2020-06-01T15:17:18Z

@zetaab +1, the scheduler could use significant improvements to its current logging abilities. That's an upgrade I've been meaning to tackle for a while and I've finally opened an issue for it here: #91633

alculquicondor · 2020-06-05T20:27:41Z

/assign

I'm looking into this. A few questions to help me narrow the case. I haven't been able to reproduce yet.

What was created first: the daemonset or the node?
Are you using the default profile?

alculquicondor · 2020-06-05T21:02:22Z

Do you have extenders?

jejer · 2020-06-09T01:53:31Z

nodes were created before the daemonset.
suppose we used the default profile, which profile do you mean and how to check?
no extenders.

    command:
    - /usr/local/bin/kube-scheduler
    - --address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig
    - --profiling=false
    - --v=1

Another thing that may impact is the disk performance is not very good for etcd, etcd complains about slow operations.

maelk · 2020-07-22T20:09:50Z

yes, you can run it in the unit tests by adding the small code I put

maelk · 2020-07-22T20:10:26Z

I am now working on adding a test case for the snapshot, to make sure this is properly tested.

maelk · 2020-07-22T20:17:32Z

big thumbs up to @igraecao for the help in reproducing the issue and running the tests in his setup

ahg-g · 2020-07-22T21:05:47Z

Thanks all for debugging this notorious issue. Resetting the index before creating the list is safe, so I think we should go with that for 1.18 and 1.19 patches, and have a proper fix in the master branch.

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

alculquicondor · 2020-07-22T23:25:48Z

I understand the issue now: The calculation of whether or not a zone is exhausted is wrong because it doesn't consider where in each zone we started this "UpdateSnapshot" process. And yeah, it would only be visible with uneven zones.

Great job spotting this @maelk!

I would think we have the same issue in older versions. However, it is hidden by the fact that we do a tree pass every time. Whereas in 1.18 we snapshot the result until there are changes in the tree.

Now that the round-robin strategy is implemented in generic_scheduler.go, we might be fine with simply resetting all counters before UpdateSnapshot, as your PR is doing.

kubernetes/pkg/scheduler/core/generic_scheduler.go

Line 357 in 02cf581

g.nextStartNodeIndex = (g.nextStartNodeIndex + processedNodes) % len(allNodes)

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

Huang-Wei · 2020-07-23T00:42:11Z

Thanks @maelk for spotting the root cause!

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

Huang-Wei · 2020-07-23T00:44:25Z

BTW: there is a typo in #91601 (comment), the case should be:

{
	name:           "add nodes to a new and to an exhausted zone",
	nodesToAdd:     append(allNodes[6:9], allNodes[3]),
	nodesToRemove:  nil,
	operations:     []string{"add", "add", "next", "next", "add", "add", "next", "next", "next", "next"},
	expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
	// with codecase on master and 1.18, its output is [node-6 node-7 node-3 node-8 node-6 node-3]
},

ahg-g · 2020-07-23T00:57:08Z

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

I am assuming you are talking about the logic in generic_scheduler.go, if so yes, it is doesn't matter much if nodes were added or removed, the main thing we need to avoid is iterating over the nodes in the same order every time we schedule a pod, we just need a good approximation of iterating over the nodes across pods.

ahg-g · 2020-07-23T00:59:00Z

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

yes, we just need to iterate over all zones/nodes in the same order every time.

maelk · 2020-07-23T11:48:26Z

I have updated the PR with a unit test for the function updating the snapshotlist, for that bug specifically. I can also take care of refactoring the next() function to iterate over the zones and nodes without round-robin, hence removing the issue.

ahg-g · 2020-07-23T12:18:04Z

Thanks, sounds good, but we should still iterate between zones the same way we do now, that is by design.

maelk · 2020-07-23T12:30:46Z

I don't really get what you mean here. Is it so that the order of the nodes matter and we must still go round-robin between zones or can we list all nodes of a zone, one zone after the other ? Let's say that you have two zones of two nodes each, in which order do you expect them, or does it even matter at all ?

ahg-g · 2020-07-23T12:40:31Z

The order matters, we need to alternate between zones while creating the list. If you have two zones of two nodes each z1: {n11, n12} and z2: {n21, n22}, then the list should be {n11, n21, n12, n22}

maelk · 2020-07-23T13:12:07Z

ok, thanks, I'll give it a thought. Can we meanwhile proceed with the quick fix ? btw, some tests are failing on it, but I am not sure how that relates to my PR

ahg-g · 2020-07-23T13:25:55Z

Those are flakes. Please send a patch to 1.18 as well.

maelk · 2020-07-23T13:35:24Z

Ok, will do. Thanks

soulxu · 2020-07-29T01:42:46Z

{
	name:           "add nodes to a new and to an exhausted zone",
	nodesToAdd:     append(allNodes[5:9], allNodes[3]),
	nodesToRemove:  nil,
	operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
	expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
},

@maelk, do you mean this test ignore the 'node-5'?

I found after fixed the append in #93516, the test result all the nodes can be iterated:

{
			name:           "add nodes to a new and to an exhausted zone",
			nodesToAdd:     append(append(make([]*v1.Node, 0), allNodes[5:9]...), allNodes[3]),
			nodesToRemove:  nil,
			operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
			expectedOutput: []string{"node-5", "node-6", "node-3", "node-7", "node-8", "node-5"},
},

The node-5, 6, 7, 8, 3 can be iterated.

Forgive me if misunderstand something here.

maelk · 2020-07-29T07:27:57Z

yes, it was on purpose, based on what was there, but I can see how this can be cryptic, so better make it so that the append is behaving in a clearer way. Thanks for the patch.

judgeaxl · 2020-09-14T06:25:05Z

How far back do you believe this bug was present? 1.17? 1.16? I've just seen the exact same problem in 1.17 on AWS and restarting the unscheduled node fixed the problem.

alculquicondor · 2020-09-14T14:42:53Z

@judgeaxl could you provide more details? Log lines, cache dumps, etc. So we can determine whether the issue is the same.

As I noted in #91601 (comment), I believe this bug was present in older versions, but my thinking is that it's transient.

@maelk would you be able to investigate?

alculquicondor · 2020-09-14T14:49:28Z

Please also share the distribution of nodes in the zones.

maelk · 2020-09-14T15:31:47Z

@alculquicondor unfortunately I can't at this point. Sorry.

judgeaxl · 2020-09-15T02:36:58Z

@alculquicondor sorry, I already rebuilt the cluster for other reasons, but it may have been a network configuration problem related to multi-az deployments, and in what subnet the faulty node got launched, so I wouldn't worry about it for now in the context of this issue. If I notice it again I'll report back with better details. Thanks!

alculquicondor · 2020-09-15T20:48:59Z

/retitle Some nodes are not considered in scheduling when there is zone imbalance

zetaab added the kind/bug Categorizes issue or PR as related to a bug. label May 30, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 30, 2020

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 30, 2020

liggitt added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 30, 2020

liggitt removed the triage/needs-information Indicates an issue needs more information in order to work on it. label May 30, 2020

damemi mentioned this issue Jun 1, 2020

FailedScheduling doesn't report all reasons for nodes failing #91340

Closed

damemi mentioned this issue Jun 1, 2020

Do contextual logging for scheduler #91633

Open

k8s-ci-robot assigned alculquicondor Jun 5, 2020

maelk mentioned this issue Jul 23, 2020

Fix scheduler issue with nodetree additions #93387

Merged

k8s-ci-robot closed this as completed in #93355 Jul 24, 2020

This was referenced Jul 27, 2020

Fix scheduler issue with nodetree additions #93467

Closed

Change nodeInfolist building logic in scheduler #93473

Merged

alculquicondor mentioned this issue Aug 24, 2020

PodTopologySpreadConstraints doesn't work #94058

Closed

k8s-ci-robot changed the title ~~Daemonset does not provision to all nodes, 0 nodes available~~ Some nodes are not considered in scheduling when there is zone imbalance Sep 15, 2020

Some nodes are not considered in scheduling when there is zone imbalance #91601

Some nodes are not considered in scheduling when there is zone imbalance #91601

Comments

zetaab commented May 30, 2020 • edited

zetaab commented May 30, 2020

liggitt commented May 30, 2020 • edited

zetaab commented May 30, 2020

liggitt commented May 30, 2020

liggitt commented May 30, 2020

zetaab commented May 30, 2020 • edited

liggitt commented May 30, 2020

jejer commented Jun 1, 2020 • edited

ahg-g commented Jun 1, 2020

ahg-g commented Jun 1, 2020 • edited

jejer commented Jun 1, 2020

zetaab commented Jun 1, 2020

ahg-g commented Jun 1, 2020

ahg-g commented Jun 1, 2020

damemi commented Jun 1, 2020

damemi commented Jun 1, 2020

zetaab commented Jun 1, 2020

damemi commented Jun 1, 2020

alculquicondor commented Jun 5, 2020

alculquicondor commented Jun 5, 2020

jejer commented Jun 9, 2020

maelk commented Jul 22, 2020

maelk commented Jul 22, 2020

maelk commented Jul 22, 2020

ahg-g commented Jul 22, 2020 • edited

alculquicondor commented Jul 22, 2020

Huang-Wei commented Jul 23, 2020

Huang-Wei commented Jul 23, 2020

ahg-g commented Jul 23, 2020

ahg-g commented Jul 23, 2020

maelk commented Jul 23, 2020

ahg-g commented Jul 23, 2020

maelk commented Jul 23, 2020

ahg-g commented Jul 23, 2020

maelk commented Jul 23, 2020

ahg-g commented Jul 23, 2020

maelk commented Jul 23, 2020

soulxu commented Jul 29, 2020

maelk commented Jul 29, 2020

judgeaxl commented Sep 14, 2020

alculquicondor commented Sep 14, 2020

alculquicondor commented Sep 14, 2020

maelk commented Sep 14, 2020

judgeaxl commented Sep 15, 2020

alculquicondor commented Sep 15, 2020

zetaab commented May 30, 2020 •

edited

liggitt commented May 30, 2020 •

edited

zetaab commented May 30, 2020 •

edited

jejer commented Jun 1, 2020 •

edited

ahg-g commented Jun 1, 2020 •

edited

ahg-g commented Jul 22, 2020 •

edited