Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler will run into race conditions on large scale clusters #106361

Closed
ahg-g opened this issue Nov 11, 2021 · 41 comments · Fixed by #110925
Closed

Scheduler will run into race conditions on large scale clusters #106361

ahg-g opened this issue Nov 11, 2021 · 41 comments · Fixed by #110925
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@ahg-g
Copy link
Member

ahg-g commented Nov 11, 2021

What happened?

The scheduler has a 30s timeout for the bind operation to succeed; if we don't get a response within 30s, the in-memory assignment of pod to node in the scheduler cache expires.

A race condition will happen in the follow case:

  1. pod1 is assigned to a node, scheduler cache is updated with the assignment, bind operation issued to apiserver.

  2. if the apiserver is under huge pressure, bind takes more than 30s, scheduler expires the cached pod-to-node assignment.

  3. bind eventually succeeds, but because the apiserver is under huge pressure, the pod update with the node name takes a long time to propagate to the scheduler.

  4. because the pod update took a long time to propagate and the cache entry expired, the scheduler is not aware that the assignment actually happened, and so it had no problem assigning a second pod to the same node that would otherwise not fit if the scheduler was aware that the first pod was eventually assigned to the node.

On the scheduler side, what we need to do is make the 30s longer for large clusters, and ideally adaptable to cluster state.

/sig scheduling

What did you expect to happen?

No race conditions.

How can we reproduce it (as minimally and precisely as possible)?

Create a large scale cluster.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@ahg-g ahg-g added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2021
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 11, 2021
@k8s-ci-robot
Copy link
Contributor

@ahg-g: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 11, 2021
@alculquicondor
Copy link
Member

IIRC, the 30s timeout only starts running after we receive a 200 response for the binding. But yes, if it takes longer than 30s to receive the update, we would invalidate the cache.

I don't know if there is particular reason to invalidate the cache at all. I suppose it was meant to guard against missing Pod deletion events. We need to improve that regardless.

I would be in favor of dropping the timeout altogether.

@ahg-g
Copy link
Member Author

ahg-g commented Nov 11, 2021

ok, so the deadline is effectively for receiving the update.

Perhaps making it longer, like 15min, is good first step before completely removing it just so we can easily revert in case something else goes wrong.

@ahg-g
Copy link
Member Author

ahg-g commented Nov 14, 2021

I sent #106412 to increase the timeout to 15min. We should backport this.

@wojtek-t
Copy link
Member

I don't know if there is particular reason to invalidate the cache at all. I suppose it was meant to guard against missing Pod deletion events. We need to improve that regardless.

We should double check the history, but I think that in the past we were:

  • asynchronously sending bind request
  • updating the cache before sending the request
    and then the timeout was actually protecting us from the binding request actually not succeeding.

So I agree with Aldo - that we should double check if this timeout is at all needed currently.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 30, 2022
@kerthcet
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2022
@alculquicondor
Copy link
Member

We are probably good to remove this timeout altogether at this point.

/remove-lifecycle stale

@alculquicondor
Copy link
Member

/good-first-issue

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 29, 2022
@fabianberisha
Copy link

I see this issue was labeled as "good first issue" and I would like to start working on it! @alculquicondor

@alculquicondor
Copy link
Member

one more release before we can remove this code.

@anson627
Copy link

anson627 commented Oct 31, 2022

do you have plan to cherry pick for previous versions, e.g 1.23, 1.24?

@alculquicondor
Copy link
Member

did you mean 1.13 and 1.14?

@anson627
Copy link

oh, sorry my bad, I meant for 1.23 and 1.24

@alculquicondor
Copy link
Member

It is fixed in 1.23 #106412, so it would be fixed in 1.24 too.

@anson627
Copy link

anson627 commented Oct 31, 2022

my understanding is the previous fix to increase the duration only reduce the chance of issue, but not completely avoid it, that's why the other 2 fixes #110925 and #110954 are needed, right?

@emiljanogj
Copy link

emiljanogj commented Dec 8, 2022

Hi, is this issue still available?

@alculquicondor
Copy link
Member

@anson627 yes, but since the likelihood that an issue happens is very low, I don't think it's worth the risk of cherry-picking at this point.

@alculquicondor
Copy link
Member

According to #110925 (comment), we can remove the TTL code entirely in 1.27

I'll give the chance to @kapiljain1989 to confirm if they can still do this.

In the meantime:
/unassign @fabianberisha
/unassign @zabrox
/remove-help

@k8s-ci-robot k8s-ci-robot removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. labels Dec 8, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2023
@alculquicondor
Copy link
Member

/remove-lifecycle stale
/unassign @kapiljain1989
as we haven't heard from them.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 6, 2023
@wackxu
Copy link
Contributor

wackxu commented Jul 20, 2023

I want to help fix this
/assign

@alculquicondor
Copy link
Member

Thanks. What is left is to remove the timeout logic altogether.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
Archived in project