-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler will run into race conditions on large scale clusters #106361
Comments
@ahg-g: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
IIRC, the 30s timeout only starts running after we receive a 200 response for the binding. But yes, if it takes longer than 30s to receive the update, we would invalidate the cache. I don't know if there is particular reason to invalidate the cache at all. I suppose it was meant to guard against missing Pod deletion events. We need to improve that regardless. I would be in favor of dropping the timeout altogether. |
ok, so the deadline is effectively for receiving the update. Perhaps making it longer, like 15min, is good first step before completely removing it just so we can easily revert in case something else goes wrong. |
I sent #106412 to increase the timeout to 15min. We should backport this. |
We should double check the history, but I think that in the past we were:
So I agree with Aldo - that we should double check if this timeout is at all needed currently. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
We are probably good to remove this timeout altogether at this point. /remove-lifecycle stale |
/good-first-issue |
I see this issue was labeled as "good first issue" and I would like to start working on it! @alculquicondor |
one more release before we can remove this code. |
do you have plan to cherry pick for previous versions, e.g 1.23, 1.24? |
did you mean 1.13 and 1.14? |
oh, sorry my bad, I meant for 1.23 and 1.24 |
It is fixed in 1.23 #106412, so it would be fixed in 1.24 too. |
Hi, is this issue still available? |
@anson627 yes, but since the likelihood that an issue happens is very low, I don't think it's worth the risk of cherry-picking at this point. |
According to #110925 (comment), we can remove the TTL code entirely in 1.27 I'll give the chance to @kapiljain1989 to confirm if they can still do this. In the meantime: |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
I want to help fix this |
Thanks. What is left is to remove the timeout logic altogether. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened?
The scheduler has a 30s timeout for the bind operation to succeed; if we don't get a response within 30s, the in-memory assignment of pod to node in the scheduler cache expires.
A race condition will happen in the follow case:
pod1 is assigned to a node, scheduler cache is updated with the assignment, bind operation issued to apiserver.
if the apiserver is under huge pressure, bind takes more than 30s, scheduler expires the cached pod-to-node assignment.
bind eventually succeeds, but because the apiserver is under huge pressure, the pod update with the node name takes a long time to propagate to the scheduler.
because the pod update took a long time to propagate and the cache entry expired, the scheduler is not aware that the assignment actually happened, and so it had no problem assigning a second pod to the same node that would otherwise not fit if the scheduler was aware that the first pod was eventually assigned to the node.
On the scheduler side, what we need to do is make the 30s longer for large clusters, and ideally adaptable to cluster state.
/sig scheduling
What did you expect to happen?
No race conditions.
How can we reproduce it (as minimally and precisely as possible)?
Create a large scale cluster.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: