Don't re-patch pods that are already controlled by current worker #26778

hterik · 2022-09-29T14:55:48Z

After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less.

We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler.

After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else. By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less. We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler.

…6778) After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else. By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less. We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler. (cherry picked from commit 27ec562)

hterik requested review from dstandish and jedcunningham as code owners September 29, 2022 14:55

boring-cyborg bot added provider:cncf-kubernetes area:Scheduler labels Sep 29, 2022

uranusjr approved these changes Oct 6, 2022

View reviewed changes

eladkal added this to the Airflow 2.4.2 milestone Oct 16, 2022

ephraimbuddy merged commit 27ec562 into apache:main Oct 18, 2022

ephraimbuddy added the type:bug-fix label Oct 18, 2022

ephraimbuddy mentioned this pull request Oct 20, 2022

Status of testing of Apache Airflow 2.4.2rc1 #27161

Closed

37 tasks

hterik mentioned this pull request Dec 20, 2022

Multi-threads support for processing diff queues in Kubernetes Executor #26639

Closed

droppoint mentioned this pull request Nov 21, 2023

Airflow progressive slowness #32928

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't re-patch pods that are already controlled by current worker #26778

Don't re-patch pods that are already controlled by current worker #26778

hterik commented Sep 29, 2022

Uh oh!

Don't re-patch pods that are already controlled by current worker #26778

Don't re-patch pods that are already controlled by current worker #26778

Conversation

hterik commented Sep 29, 2022

Uh oh!