Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stealing balance not accounting for recent decisions #7002

Closed
Tracked by #6993
fjetter opened this issue Sep 5, 2022 · 3 comments · Fixed by #7036
Closed
Tracked by #6993

Stealing balance not accounting for recent decisions #7002

fjetter opened this issue Sep 5, 2022 · 3 comments · Fixed by #7036
Labels
bug Something is broken p1 Affects a large population and inhibits work regression scheduler scheduling stability Issue or feature related to cluster stability (e.g. deadlock) stealing

Comments

@fjetter
Copy link
Member

fjetter commented Sep 5, 2022

The Stealing.balance implementation is currently using a greedy for-loop with multiple early termination conditions. These break statements are de-facto dead code causing too greedy and potentially destructive work stealing.

Stealing is highly sensitive to the definition of idle and saturated workers which is defined in Scheduler.check_idle_saturated.

Explanation of saturated

See also https://github.com/dask/distributed/pull/6614/files/36a60a5e358ea2a5d16597651126ac5892203b01#r952608704

I think the new definition of idle to incorporate whether there are free slots makes a lot of sense.

I didn't change the saturated definition because I don't honestly understand what it's trying to measure:

pending: float = occ * (p - nc) / (p * nc)
if 0.4 < pending > 1.9 * avg:
saturated.add(ws)

off-topic but I spent some time on this before and it's worth sharing

you can rewrite this statement to

pending: float = occ * (p - nc) / (p * nc)
pending = occ / nc - occ / p
pending = occupancy / n_threads - occupancy / processing
pending = avg_occ_per_thread - avg_occ_currently_processing

i.e. pending is "average occupancy per thread that is in state ready / not, yet executing"
or put into context of stealing, which is the only place that saturated matters, "occupancy per thread that could be stolen"
I guess what's written there is more stable in terms of floating point arithmetic.

So, the definition of saturated is spelled out something like

A worker is classified as saturated if it's ready occupancy is at least 0.4s per thread and at least 1.9 times the average cluster wide occupancy.

i.e. "there is something queued up / in ready and it is about two times more than average"

Simplified, this is how the algorithm is intended to work:
Stealing.balance iterates over saturated workers in descending occupancy and victimizes them for stealing. An idle worker will act as the thief and work will be rebalanced. This continues until there are no longer saturated workers or there are no longer idle workers.

However, this pseudo algorithm is no longer implemented. The current code only shows hints at how this was formerly achieved and it likely strongly depended on dicts, sets, etc. being passed by reference such that any changes to the fundamental collections are immediately propagated to the balance loop which then allows for early termination.

Stealing.maybe_move_task is calling Scheduler. check_idle_saturated with an additional keyword occ which is only used here. This keyword allows to override the actual occupancy of the worker which would allow us to reclassify thieves and victims based on their combined_occupancy, i.e. the Scheduler.idle and Scheduler.saturated collections are updated under the assumption that in-flight tasks are stolen successfully.

s.check_idle_saturated(victim, occ=occ_victim)
s.check_idle_saturated(thief, occ=occ_thief)

Assuming that idle in the balance for-loop was referencing Scheduler.idle and not be a derivative, this would allow us to break out of the loop at various places, e.g. here, here or here. However, the loop variable idle is always defined as a list since #5665 and even before is sometimes a list due to sorting ever since #754
This effectively renders any if not idle; break statements dead code. This can be confirmed by codecoverage measurements

image

Going back to #750 it appears that a similar mechanism existed for early termination if saturated became empty. The implementation changed drastically and I do not see any leftover termination mechanism that ensures a potential victim is still saturated if one accounts for combined_occupancy.

I suspect this lack of early termination to be a major contributor to at least the following tickets

@fjetter fjetter added stability Issue or feature related to cluster stability (e.g. deadlock) p1 Affects a large population and inhibits work regression scheduling stealing scheduler labels Sep 5, 2022
@fjetter fjetter added the bug Something is broken label Sep 5, 2022
@fjetter
Copy link
Member Author

fjetter commented Sep 6, 2022

Restoring the link to the actual scheduler dictionary again is not that difficult

diff --git a/distributed/stealing.py b/distributed/stealing.py
index b9858f465..4d2abc044 100644
--- a/distributed/stealing.py
+++ b/distributed/stealing.py
@@ -406,13 +413,13 @@ class WorkStealing(SchedulerPlugin):
                         occ_thief,
                     )
                 )
-                s.check_idle_saturated(victim, occ=occ_victim)
-                s.check_idle_saturated(thief, occ=occ_thief)
+                s.check_idle_saturated(thief, occ=combined_occupancy(thief))
+                s.check_idle_saturated(victim, occ=combined_occupancy(victim))

         with log_errors():
             i = 0
             # Paused and closing workers must never become thieves
-            idle = [ws for ws in s.idle.values() if ws.status == Status.running]
+            idle = s.idle.values()
             if not idle or len(idle) == len(s.workers):
                 return

@@ -429,8 +436,6 @@ class WorkStealing(SchedulerPlugin):
                 ]
             elif len(saturated) < 20:
                 saturated = sorted(saturated, key=combined_occupancy, reverse=True)
-            if len(idle) < 20:
-                idle = sorted(idle, key=combined_occupancy)

             for level, cost_multiplier in enumerate(self.co

However, even if done correctly, check_idle_saturated still won't declassify a worker as idle since it takes into account processing tasks which will only be populated after the stealing handshake is completed and the victim acknowledged the steal. Therefore, an empty worker will still very greedily steel tasks beyond reason.

@wence-
Copy link
Contributor

wence- commented Sep 8, 2022

off-topic but I spent some time on this before and it's worth sharing

you can rewrite this statement to

pending: float = occ * (p - nc) / (p * nc)
pending = occ / nc - occ / p
pending = occupancy / n_threads - occupancy / processing
pending = avg_occ_per_thread - avg_occ_currently_processing

i.e. pending is "average occupancy per thread that is in state ready / not, yet executing" or put into context of stealing, which is the only place that saturated matters, "occupancy per thread that could be stolen" I guess what's written there is more stable in terms of floating point arithmetic.

Assuming $\text{occ} \in [0, 1]$, $\text{nc} \in [1, 20]$, $\text{p} \in [0, 1000]$, herbie agrees with your rewrite (giving you an extra 0.4 bits of accuracy compared to the current implementation).

@fjetter
Copy link
Member Author

fjetter commented Sep 8, 2022

Assuming , , , herbie agrees with your rewrite (giving you an extra 0.4 bits of accuracy compared to the current implementation).

interesting. Typically occ is much larger though, it's rather [0, 1000) . If I put this in ,the precision does not improve that much but it is assuring to know that there is no "floating point arith" reason for this to be that complex. Either way, 0.4 bits is not a big deal for this kind of quantity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken p1 Affects a large population and inhibits work regression scheduler scheduling stability Issue or feature related to cluster stability (e.g. deadlock) stealing
Projects
None yet
2 participants