-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some nodes are not considered in scheduling when there is zone imbalance #91601
Comments
/sig scheduling |
Can you provide the full yaml of the node, daemonset, an example pod, and the containing namespace as retrieved from the server? |
node: daemonset: example pod (working): example pod (not scheduling): namespace: |
DaemonSet pods schedule with a nodeAffinity selector that only matches a single node, so the "12 out of 13 didn't match" message is expected. |
I don't see a reason why the scheduler would be unhappy with the pod/node combo… there's no ports that could conflict in the podspec, the node is not unschedulable or tainted, and has sufficient resources |
Okay I restarted all 3 schedulers (changed loglevel to 4 if we can see something interesting there). However, it fixed the issue
now all daemonsets are provisioned correctly. Weird, anyways something wrong with the scheduler it seems |
cc @kubernetes/sig-scheduling-bugs @ahg-g |
We see same similar issue on v1.18.3, one node cannot be scheduled for daemonset pods.
|
Hard to debug without knowing how to repreduce. Do you have the scheduler logs by any chance for the failed to schedule pod? |
I assume only one of them is named
Can you share what you noticed? |
set loglevel to 9, but it seems there is nothing more interesting, below logs are looping.
|
yeah I could not see anything more than same line
|
the strange thing is that the log message is showing the result for 7 nodes only, like the issue reported in #91340 |
/cc @damemi |
@ahg-g this does look like the same issue I reported there, it seems like we either have a filter plugin that doesn't always report its error or some other condition that's failing silently if I had to guess |
Note that in my issue, restarting the scheduler also fixed it (as mentioned in this thread too #91601 (comment)) Mine was also about a daemonset, so I think this is a duplicate. If that's the case we can close this and continue discussion in #91340 |
Anyways scheduler needs more verbose logging option, its impossible to debug these issues if there are not logs about what it does |
/assign I'm looking into this. A few questions to help me narrow the case. I haven't been able to reproduce yet.
|
|
nodes were created before the daemonset.
Another thing that may impact is the disk performance is not very good for etcd, etcd complains about slow operations. |
yes, you can run it in the unit tests by adding the small code I put |
I am now working on adding a test case for the snapshot, to make sure this is properly tested. |
big thumbs up to @igraecao for the help in reproducing the issue and running the tests in his setup |
Thanks all for debugging this notorious issue. Resetting the index before creating the list is safe, so I think we should go with that for 1.18 and 1.19 patches, and have a proper fix in the master branch. The purpose of the |
I understand the issue now: The calculation of whether or not a zone is exhausted is wrong because it doesn't consider where in each zone we started this "UpdateSnapshot" process. And yeah, it would only be visible with uneven zones. Great job spotting this @maelk! I would think we have the same issue in older versions. However, it is hidden by the fact that we do a tree pass every time. Whereas in 1.18 we snapshot the result until there are changes in the tree. Now that the round-robin strategy is implemented in generic_scheduler.go, we might be fine with simply resetting all counters before UpdateSnapshot, as your PR is doing.
Just to double check @ahg-g, this should be fine even in a cluster were new nodes are added/removed all the time, right? |
Thanks @maelk for spotting the root cause!
Given that |
BTW: there is a typo in #91601 (comment), the case should be: {
name: "add nodes to a new and to an exhausted zone",
nodesToAdd: append(allNodes[6:9], allNodes[3]),
nodesToRemove: nil,
operations: []string{"add", "add", "next", "next", "add", "add", "next", "next", "next", "next"},
expectedOutput: []string{"node-6", "node-7", "node-3", "node-8", "node-6", "node-7"},
// with codecase on master and 1.18, its output is [node-6 node-7 node-3 node-8 node-6 node-3]
}, |
I am assuming you are talking about the logic in generic_scheduler.go, if so yes, it is doesn't matter much if nodes were added or removed, the main thing we need to avoid is iterating over the nodes in the same order every time we schedule a pod, we just need a good approximation of iterating over the nodes across pods. |
yes, we just need to iterate over all zones/nodes in the same order every time. |
I have updated the PR with a unit test for the function updating the snapshotlist, for that bug specifically. I can also take care of refactoring the next() function to iterate over the zones and nodes without round-robin, hence removing the issue. |
Thanks, sounds good, but we should still iterate between zones the same way we do now, that is by design. |
I don't really get what you mean here. Is it so that the order of the nodes matter and we must still go round-robin between zones or can we list all nodes of a zone, one zone after the other ? Let's say that you have two zones of two nodes each, in which order do you expect them, or does it even matter at all ? |
The order matters, we need to alternate between zones while creating the list. If you have two zones of two nodes each |
ok, thanks, I'll give it a thought. Can we meanwhile proceed with the quick fix ? btw, some tests are failing on it, but I am not sure how that relates to my PR |
Those are flakes. Please send a patch to 1.18 as well. |
Ok, will do. Thanks |
@maelk, do you mean this test ignore the 'node-5'? I found after fixed the append in #93516, the test result all the nodes can be iterated:
The node-5, 6, 7, 8, 3 can be iterated. Forgive me if misunderstand something here. |
yes, it was on purpose, based on what was there, but I can see how this can be cryptic, so better make it so that the append is behaving in a clearer way. Thanks for the patch. |
How far back do you believe this bug was present? 1.17? 1.16? I've just seen the exact same problem in 1.17 on AWS and restarting the unscheduled node fixed the problem. |
@judgeaxl could you provide more details? Log lines, cache dumps, etc. So we can determine whether the issue is the same. As I noted in #91601 (comment), I believe this bug was present in older versions, but my thinking is that it's transient. @maelk would you be able to investigate? |
Please also share the distribution of nodes in the zones. |
@alculquicondor unfortunately I can't at this point. Sorry. |
@alculquicondor sorry, I already rebuilt the cluster for other reasons, but it may have been a network configuration problem related to multi-az deployments, and in what subnet the faulty node got launched, so I wouldn't worry about it for now in the context of this issue. If I notice it again I'll report back with better details. Thanks! |
/retitle Some nodes are not considered in scheduling when there is zone imbalance |
What happened: We upgraded 15 kubernetes clusters from 1.17.5 to 1.18.2/1.18.3 and started to see that daemonsets does not work properly anymore.
The problem is that all daemonset pods does not provision. It will return following error message to events:
However, all nodes are available and it does not have node selector. Nodes does not have taints either.
daemonset https://gist.github.com/zetaab/4a605cb3e15e349934cb7db29ec72bd8
I have tried to restart apiserver/schedulers/controller-managers but it does not help. Also I have tried to restart that single node that is stuck (nodes-z2-1-kaasprod-k8s-local) but it does not help either. Only deleting that node and recreating it helps.
We are seeing this randomly in all of our clusters.
What you expected to happen: I expect that daemonset will provision to all nodes.
How to reproduce it (as minimally and precisely as possible): No idea really, install 1.18.x kubernetes and deploy daemonset and after that wait days(?)
Anything else we need to know?: When this happens we cannot provision any other daemonsets to that node either. Like you can see logging fluent-bit is also missing. I cannot see any errors in that node kubelet logs and like said, restarting does not help.
Like you can see, most of the daemonsets are missing one pod
Environment:
kubectl version
): 1.18.3cat /etc/os-release
): debian busteruname -a
): Linux nodes-z2-1-kaasprod-k8s-local 4.19.0-8-cloud-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.19.98-1+deb10u1 (2020-04-27) x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: