Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some nodes are not considered in scheduling when there is zone imbalance #91601

Closed
zetaab opened this issue May 30, 2020 · 129 comments · Fixed by #93355 or #93473
Closed

Some nodes are not considered in scheduling when there is zone imbalance #91601

zetaab opened this issue May 30, 2020 · 129 comments · Fixed by #93355 or #93473
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@zetaab
Copy link
Member

zetaab commented May 30, 2020

What happened: We upgraded 15 kubernetes clusters from 1.17.5 to 1.18.2/1.18.3 and started to see that daemonsets does not work properly anymore.

The problem is that all daemonset pods does not provision. It will return following error message to events:

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  9s (x5 over 71s)  default-scheduler  0/13 nodes are available: 12 node(s) didn't match node selector.

However, all nodes are available and it does not have node selector. Nodes does not have taints either.

daemonset https://gist.github.com/zetaab/4a605cb3e15e349934cb7db29ec72bd8

% kubectl get nodes
NAME                                   STATUS   ROLES    AGE   VERSION
e2etest-1-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-2-kaasprod-k8s-local           Ready    node     46h   v1.18.3
e2etest-3-kaasprod-k8s-local           Ready    node     44h   v1.18.3
e2etest-4-kaasprod-k8s-local           Ready    node     44h   v1.18.3
master-zone-1-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-2-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
master-zone-3-1-1-kaasprod-k8s-local   Ready    master   47h   v1.18.3
nodes-z1-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z1-2-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z2-1-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z2-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3
nodes-z3-1-kaasprod-k8s-local          Ready    node     47h   v1.18.3
nodes-z3-2-kaasprod-k8s-local          Ready    node     46h   v1.18.3

% kubectl get pods -n weave -l weave-scope-component=agent -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP           NODE                                   NOMINATED NODE   READINESS GATES
weave-scope-agent-2drzw   1/1     Running   0          26h     10.1.32.23   e2etest-1-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-4kpxc   1/1     Running   3          26h     10.1.32.12   nodes-z1-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-78n7r   1/1     Running   0          26h     10.1.32.7    e2etest-4-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-9m4n8   1/1     Running   0          26h     10.1.96.4    master-zone-1-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-b2gnk   1/1     Running   1          26h     10.1.96.12   master-zone-3-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-blwtx   1/1     Running   2          26h     10.1.32.20   nodes-z1-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-cbhjg   1/1     Running   0          26h     10.1.64.15   e2etest-2-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-csp49   1/1     Running   0          26h     10.1.96.14   e2etest-3-kaasprod-k8s-local           <none>           <none>
weave-scope-agent-g4k2x   1/1     Running   1          26h     10.1.64.10   nodes-z2-2-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-kx85h   1/1     Running   2          26h     10.1.96.6    nodes-z3-1-kaasprod-k8s-local          <none>           <none>
weave-scope-agent-lllqc   0/1     Pending   0          5m56s   <none>       <none>                                 <none>           <none>
weave-scope-agent-nls2h   1/1     Running   0          26h     10.1.96.17   master-zone-2-1-1-kaasprod-k8s-local   <none>           <none>
weave-scope-agent-p8njs   1/1     Running   2          26h     10.1.96.19   nodes-z3-2-kaasprod-k8s-local          <none>           <none>

I have tried to restart apiserver/schedulers/controller-managers but it does not help. Also I have tried to restart that single node that is stuck (nodes-z2-1-kaasprod-k8s-local) but it does not help either. Only deleting that node and recreating it helps.

% kubectl describe node nodes-z2-1-kaasprod-k8s-local
Name:               nodes-z2-1-kaasprod-k8s-local
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=zone-2
                    kops.k8s.io/instancegroup=nodes-z2
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=nodes-z2-1-kaasprod-k8s-local
                    kubernetes.io/os=linux
                    kubernetes.io/role=node
                    node-role.kubernetes.io/node=
                    node.kubernetes.io/instance-type=59cf4871-de1b-4294-9e9f-2ea7ca4b771f
                    topology.cinder.csi.openstack.org/zone=zone-2
                    topology.kubernetes.io/region=regionOne
                    topology.kubernetes.io/zone=zone-2
Annotations:        csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"faf14d22-010f-494a-9b34-888bdad1d2df"}
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.1.64.32/19
                    projectcalico.org/IPv4IPIPTunnelAddr: 100.98.136.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 28 May 2020 13:28:24 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  nodes-z2-1-kaasprod-k8s-local
  AcquireTime:     <unset>
  RenewTime:       Sat, 30 May 2020 12:02:13 +0300
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 29 May 2020 09:40:51 +0300   Fri, 29 May 2020 09:40:51 +0300   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 30 May 2020 11:59:53 +0300   Fri, 29 May 2020 09:40:45 +0300   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.1.64.32
  Hostname:    nodes-z2-1-kaasprod-k8s-local
Capacity:
  cpu:                4
  ephemeral-storage:  10287360Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8172420Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  9480830961
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8070020Ki
  pods:               110
System Info:
  Machine ID:                 c94284656ff04cf090852c1ddee7bcc2
  System UUID:                faf14d22-010f-494a-9b34-888bdad1d2df
  Boot ID:                    295dc3d9-0a90-49ee-92f3-9be45f2f8e3d
  Kernel Version:             4.19.0-8-cloud-amd64
  OS Image:                   Debian GNU/Linux 10 (buster)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.8
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
PodCIDR:                      100.96.12.0/24
PodCIDRs:                     100.96.12.0/24
ProviderID:                   openstack:///faf14d22-010f-494a-9b34-888bdad1d2df
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-77pqs                           100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  kube-system                 kube-proxy-nodes-z2-1-kaasprod-k8s-local    100m (2%)     200m (5%)   100Mi (1%)       100Mi (1%)     46h
  volume                      csi-cinder-nodeplugin-5jbvl                 100m (2%)     400m (10%)  200Mi (2%)       200Mi (2%)     46h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                300m (7%)   800m (20%)
  memory             400Mi (5%)  400Mi (5%)
  ephemeral-storage  0 (0%)      0 (0%)
Events:
  Type    Reason                   Age    From                                    Message
  ----    ------                   ----   ----                                    -------
  Normal  Starting                 7m27s  kubelet, nodes-z2-1-kaasprod-k8s-local  Starting kubelet.
  Normal  NodeHasSufficientMemory  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Node nodes-z2-1-kaasprod-k8s-local status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  7m26s  kubelet, nodes-z2-1-kaasprod-k8s-local  Updated Node Allocatable limit across pods

We are seeing this randomly in all of our clusters.

What you expected to happen: I expect that daemonset will provision to all nodes.

How to reproduce it (as minimally and precisely as possible): No idea really, install 1.18.x kubernetes and deploy daemonset and after that wait days(?)

Anything else we need to know?: When this happens we cannot provision any other daemonsets to that node either. Like you can see logging fluent-bit is also missing. I cannot see any errors in that node kubelet logs and like said, restarting does not help.

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        12      13           12          <none>                            337d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   174d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   193d
kube-system   metricbeat                 6         6         5       6            5           <none>                            35d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   337d
logging       fluent-bit                 13        13        12      13           12          <none>                            337d
monitoring    node-exporter              13        13        12      13           12          kubernetes.io/os=linux            58d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        12      13           12          <none>                            193d
weave         weavescope-iowait-plugin   6         6         5       6            5           <none>                            193d

Like you can see, most of the daemonsets are missing one pod

Environment:

  • Kubernetes version (use kubectl version): 1.18.3
  • Cloud provider or hardware configuration: openstack
  • OS (e.g: cat /etc/os-release): debian buster
  • Kernel (e.g. uname -a): Linux nodes-z2-1-kaasprod-k8s-local 4.19.0-8-cloud-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/Linux
  • Install tools: kops
  • Network plugin and version (if this is a network-related bug): calico
  • Others:
@zetaab zetaab added the kind/bug Categorizes issue or PR as related to a bug. label May 30, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 30, 2020
@zetaab
Copy link
Member Author

zetaab commented May 30, 2020

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 30, 2020
@liggitt
Copy link
Member

liggitt commented May 30, 2020

Can you provide the full yaml of the node, daemonset, an example pod, and the containing namespace as retrieved from the server?

@liggitt liggitt added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 30, 2020
@liggitt
Copy link
Member

liggitt commented May 30, 2020

DaemonSet pods schedule with a nodeAffinity selector that only matches a single node, so the "12 out of 13 didn't match" message is expected.

@liggitt
Copy link
Member

liggitt commented May 30, 2020

I don't see a reason why the scheduler would be unhappy with the pod/node combo… there's no ports that could conflict in the podspec, the node is not unschedulable or tainted, and has sufficient resources

@zetaab
Copy link
Member Author

zetaab commented May 30, 2020

Okay I restarted all 3 schedulers (changed loglevel to 4 if we can see something interesting there). However, it fixed the issue

% kubectl get ds --all-namespaces
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
falco         falco-daemonset            13        13        13      13           13          <none>                            338d
kube-system   audit-webhook-deployment   3         3         3       3            3           node-role.kubernetes.io/master=   175d
kube-system   calico-node                13        13        13      13           13          kubernetes.io/os=linux            36d
kube-system   kops-controller            3         3         3       3            3           node-role.kubernetes.io/master=   194d
kube-system   metricbeat                 6         6         6       6            6           <none>                            36d
kube-system   openstack-cloud-provider   3         3         3       3            3           node-role.kubernetes.io/master=   338d
logging       fluent-bit                 13        13        13      13           13          <none>                            338d
monitoring    node-exporter              13        13        13      13           13          kubernetes.io/os=linux            59d
volume        csi-cinder-nodeplugin      6         6         6       6            6           <none>                            239d
weave         weave-scope-agent          13        13        13      13           13          <none>                            194d
weave         weavescope-iowait-plugin   6         6         6       6            6           <none>                            194d

now all daemonsets are provisioned correctly. Weird, anyways something wrong with the scheduler it seems

@liggitt liggitt removed the triage/needs-information Indicates an issue needs more information in order to work on it. label May 30, 2020
@liggitt
Copy link
Member

liggitt commented May 30, 2020

cc @kubernetes/sig-scheduling-bugs @ahg-g

@jejer
Copy link

jejer commented Jun 1, 2020

We see same similar issue on v1.18.3, one node cannot be scheduled for daemonset pods.
restart scheduler helps.

[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get pod -A|grep Pending
kube-system   coredns-vc5ws                                                 0/1     Pending   0          2d16h
kube-system   local-volume-provisioner-mwk88                                0/1     Pending   0          2d16h
kube-system   svcwatcher-ltqb6                                              0/1     Pending   0          2d16h
ncms          bcmt-api-hfzl6                                                0/1     Pending   0          2d16h
ncms          bcmt-yum-repo-589d8bb756-5zbvh                                0/1     Pending   0          2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get ds -A
NAMESPACE     NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-system   coredns                    3         3         2       3            2           is_control=true                 2d16h
kube-system   danmep-cleaner             0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   kube-proxy                 8         8         8       8            8           <none>                          2d16h
kube-system   local-volume-provisioner   8         8         7       8            7           <none>                          2d16h
kube-system   netwatcher                 0         0         0       0            0           cbcs.nokia.com/danm_node=true   2d16h
kube-system   sriov-device-plugin        0         0         0       0            0           sriov=enabled                   2d16h
kube-system   svcwatcher                 3         3         2       3            2           is_control=true                 2d16h
ncms          bcmt-api                   3         3         0       3            0           is_control=true                 2d16h
[root@tesla-cb0434-csfp1-csfp1-control-03 ~]# kubectl get node
NAME                                  STATUS   ROLES    AGE     VERSION
tesla-cb0434-csfp1-csfp1-control-01   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-02   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-control-03   Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-01      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-edge-02      Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-01    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-02    Ready    <none>   2d16h   v1.18.3
tesla-cb0434-csfp1-csfp1-worker-03    Ready    <none>   2d16h   v1.18.3

@ahg-g
Copy link
Member

ahg-g commented Jun 1, 2020

Hard to debug without knowing how to repreduce. Do you have the scheduler logs by any chance for the failed to schedule pod?

@ahg-g
Copy link
Member

ahg-g commented Jun 1, 2020

Okay I restarted all 3 schedulers

I assume only one of them is named default-scheduler, correct?

changed loglevel to 4 if we can see something interesting there

Can you share what you noticed?

@jejer
Copy link

jejer commented Jun 1, 2020

set loglevel to 9, but it seems there is nothing more interesting, below logs are looping.

I0601 01:45:05.039373       1 generic_scheduler.go:290] Preemption will not help schedule pod kube-system/coredns-vc5ws on any node.
I0601 01:45:05.039437       1 factory.go:462] Unable to schedule kube-system/coredns-vc5ws: no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting
I0601 01:45:05.039494       1 scheduler.go:776] Updating pod condition for kube-system/coredns-vc5ws to (PodScheduled==False, Reason=Unschedulable)

@zetaab
Copy link
Member Author

zetaab commented Jun 1, 2020

yeah I could not see anything more than same line

no fit: 0/8 nodes are available: 7 node(s) didn't match node selector.; waiting

@ahg-g
Copy link
Member

ahg-g commented Jun 1, 2020

the strange thing is that the log message is showing the result for 7 nodes only, like the issue reported in #91340

@ahg-g
Copy link
Member

ahg-g commented Jun 1, 2020

/cc @damemi

@damemi
Copy link
Contributor

damemi commented Jun 1, 2020

@ahg-g this does look like the same issue I reported there, it seems like we either have a filter plugin that doesn't always report its error or some other condition that's failing silently if I had to guess

@damemi
Copy link
Contributor

damemi commented Jun 1, 2020

Note that in my issue, restarting the scheduler also fixed it (as mentioned in this thread too #91601 (comment))

Mine was also about a daemonset, so I think this is a duplicate. If that's the case we can close this and continue discussion in #91340

@zetaab
Copy link
Member Author

zetaab commented Jun 1, 2020

Anyways scheduler needs more verbose logging option, its impossible to debug these issues if there are not logs about what it does

@damemi
Copy link
Contributor

damemi commented Jun 1, 2020

@zetaab +1, the scheduler could use significant improvements to its current logging abilities. That's an upgrade I've been meaning to tackle for a while and I've finally opened an issue for it here: #91633

@alculquicondor
Copy link
Member

/assign

I'm looking into this. A few questions to help me narrow the case. I haven't been able to reproduce yet.

  • What was created first: the daemonset or the node?
  • Are you using the default profile?

@alculquicondor
Copy link
Member

  • Do you have extenders?

@jejer
Copy link

jejer commented Jun 9, 2020

nodes were created before the daemonset.
suppose we used the default profile, which profile do you mean and how to check?
no extenders.

    command:
    - /usr/local/bin/kube-scheduler
    - --address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig
    - --profiling=false
    - --v=1

Another thing that may impact is the disk performance is not very good for etcd, etcd complains about slow operations.

@maelk
Copy link
Contributor

maelk commented Jul 22, 2020

yes, you can run it in the unit tests by adding the small code I put

@maelk
Copy link
Contributor

maelk commented Jul 22, 2020

I am now working on adding a test case for the snapshot, to make sure this is properly tested.

@maelk
Copy link
Contributor

maelk commented Jul 22, 2020

big thumbs up to @igraecao for the help in reproducing the issue and running the tests in his setup

@ahg-g
Copy link
Member

ahg-g commented Jul 22, 2020

Thanks all for debugging this notorious issue. Resetting the index before creating the list is safe, so I think we should go with that for 1.18 and 1.19 patches, and have a proper fix in the master branch.

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

@alculquicondor
Copy link
Member

I understand the issue now: The calculation of whether or not a zone is exhausted is wrong because it doesn't consider where in each zone we started this "UpdateSnapshot" process. And yeah, it would only be visible with uneven zones.

Great job spotting this @maelk!

I would think we have the same issue in older versions. However, it is hidden by the fact that we do a tree pass every time. Whereas in 1.18 we snapshot the result until there are changes in the tree.

Now that the round-robin strategy is implemented in generic_scheduler.go, we might be fine with simply resetting all counters before UpdateSnapshot, as your PR is doing.

g.nextStartNodeIndex = (g.nextStartNodeIndex + processedNodes) % len(allNodes)

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

@Huang-Wei
Copy link
Member

Thanks @maelk for spotting the root cause!

The purpose of the next function changed with the introduction of the NodeInfoList, and so we can certainly simplify it and perhaps change it to toList, a function that creates a list from the tree and simply start from the beginning every time.

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

@Huang-Wei
Copy link
Member

BTW: there is a typo in #91601 (comment), the case should be:

{
	name:           "add nodes to a new and to an exhausted zone",
	nodesToAdd:     append(allNodes[6:9], allNodes[3]),
	nodesToRemove:  nil,
	operations:     []string{"add", "add", "next", "next", "add", "add", "next", "next", "next", "next"},
	expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
	// with codecase on master and 1.18, its output is [node-6 node-7 node-3 node-8 node-6 node-3]
},

@ahg-g
Copy link
Member

ahg-g commented Jul 23, 2020

Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right?

I am assuming you are talking about the logic in generic_scheduler.go, if so yes, it is doesn't matter much if nodes were added or removed, the main thing we need to avoid is iterating over the nodes in the same order every time we schedule a pod, we just need a good approximation of iterating over the nodes across pods.

@ahg-g
Copy link
Member

ahg-g commented Jul 23, 2020

Given that cache.nodeTree.next() is only called in building the snapshot nodeInfoList, I think it's also safe to remove the indexes (both zoneIndex and nodeIndex) from nodeTree struct. Instead, come up with a simple nodeIterator() function to iterate through its zone/node in a round-robin manner.

yes, we just need to iterate over all zones/nodes in the same order every time.

@maelk
Copy link
Contributor

maelk commented Jul 23, 2020

I have updated the PR with a unit test for the function updating the snapshotlist, for that bug specifically. I can also take care of refactoring the next() function to iterate over the zones and nodes without round-robin, hence removing the issue.

@ahg-g
Copy link
Member

ahg-g commented Jul 23, 2020

Thanks, sounds good, but we should still iterate between zones the same way we do now, that is by design.

@maelk
Copy link
Contributor

maelk commented Jul 23, 2020

I don't really get what you mean here. Is it so that the order of the nodes matter and we must still go round-robin between zones or can we list all nodes of a zone, one zone after the other ? Let's say that you have two zones of two nodes each, in which order do you expect them, or does it even matter at all ?

@ahg-g
Copy link
Member

ahg-g commented Jul 23, 2020

The order matters, we need to alternate between zones while creating the list. If you have two zones of two nodes each z1: {n11, n12} and z2: {n21, n22}, then the list should be {n11, n21, n12, n22}

@maelk
Copy link
Contributor

maelk commented Jul 23, 2020

ok, thanks, I'll give it a thought. Can we meanwhile proceed with the quick fix ? btw, some tests are failing on it, but I am not sure how that relates to my PR

@ahg-g
Copy link
Member

ahg-g commented Jul 23, 2020

Those are flakes. Please send a patch to 1.18 as well.

@maelk
Copy link
Contributor

maelk commented Jul 23, 2020

Ok, will do. Thanks

@soulxu
Copy link
Contributor

soulxu commented Jul 29, 2020

{
	name:           "add nodes to a new and to an exhausted zone",
	nodesToAdd:     append(allNodes[5:9], allNodes[3]),
	nodesToRemove:  nil,
	operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
	expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
},

@maelk, do you mean this test ignore the 'node-5'?

I found after fixed the append in #93516, the test result all the nodes can be iterated:

{
			name:           "add nodes to a new and to an exhausted zone",
			nodesToAdd:     append(append(make([]*v1.Node, 0), allNodes[5:9]...), allNodes[3]),
			nodesToRemove:  nil,
			operations:     []string{"add", "add", "next", "next", "add", "add", "add", "next", "next", "next", "next"},
			expectedOutput: []string{"node-5", "node-6", "node-3", "node-7", "node-8", "node-5"},
},

The node-5, 6, 7, 8, 3 can be iterated.

Forgive me if misunderstand something here.

@maelk
Copy link
Contributor

maelk commented Jul 29, 2020

yes, it was on purpose, based on what was there, but I can see how this can be cryptic, so better make it so that the append is behaving in a clearer way. Thanks for the patch.

@judgeaxl
Copy link

How far back do you believe this bug was present? 1.17? 1.16? I've just seen the exact same problem in 1.17 on AWS and restarting the unscheduled node fixed the problem.

@alculquicondor
Copy link
Member

@judgeaxl could you provide more details? Log lines, cache dumps, etc. So we can determine whether the issue is the same.

As I noted in #91601 (comment), I believe this bug was present in older versions, but my thinking is that it's transient.

@maelk would you be able to investigate?

@alculquicondor
Copy link
Member

Please also share the distribution of nodes in the zones.

@maelk
Copy link
Contributor

maelk commented Sep 14, 2020

@alculquicondor unfortunately I can't at this point. Sorry.

@judgeaxl
Copy link

@alculquicondor sorry, I already rebuilt the cluster for other reasons, but it may have been a network configuration problem related to multi-az deployments, and in what subnet the faulty node got launched, so I wouldn't worry about it for now in the context of this issue. If I notice it again I'll report back with better details. Thanks!

@alculquicondor
Copy link
Member

/retitle Some nodes are not considered in scheduling when there is zone imbalance

@k8s-ci-robot k8s-ci-robot changed the title Daemonset does not provision to all nodes, 0 nodes available Some nodes are not considered in scheduling when there is zone imbalance Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet