Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow stealing of fast tasks in some situations #6115

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mrocklin
Copy link
Member

Previously we avoided the stealing of fast tasks in some situations. It
was generally considered a bad idea to spend non-trivial time in order
to move around a 1ms task. However, sometimes it's not this task that
matters, but the tasks that it unblocks. Being strict about not
stealing may not always be ideal.

This commit relaxes stealing of fast tasks in three ways:

  1. It sets a minimum duration of tasks to 5ms,
    this being something like the overhead in a real system

  2. It changes the network latency variable in stealing.py from 100ms to 10ms

  3. It no longer completely rules out very fast tasks from stealing, but
    instead lets them compete based on their compute/transfer ratio
    (which should ordinarily be terrible)

    In cases where transfer times are trivial this becomes doable again.

Tests don't pass yet, this is up for comments

Fixes #4471

Previously we avoided the stealing of fast tasks in some situations.  It
was generally considered a bad idea to spend non-trivial time in order
to move around a 1ms task.  However, sometimes it's not this task that
matters, but the tasks that it unblocks.  Being strict about not
stealing may not always be ideal.

This commit relaxes stealing of fast tasks in three ways:

1.  It sets a minimum duration of tasks to 5ms,
    this being something like the overhead in a real system

2.  It changes the network latency variable in stealing.py from 100ms to 10ms

3.  It no longer completely rules out very fast tasks from stealing, but
    instead lets them compete based on their compute/transfer ratio
    (which should ordinarily be terrible)

    In cases where transfer times are trivial this becomes doable again.

Tests don't pass yet, this is up for comments
@jakirkham
Copy link
Member

cc @fjetter @gjoseph92

@github-actions
Copy link
Contributor

Unit Test Results

       16 files  ±0         16 suites  ±0   7h 16m 59s ⏱️ - 22m 12s
  2 730 tests ±0    2 647 ✔️  -   1       81 💤 ±0    2 +  1 
21 725 runs  ±0  20 634 ✔️  - 11  1 075 💤  - 4  16 +15 

For more details on these failures, see this check.

Results for commit c9613fe. ± Comparison against base commit bd3f47e.

@@ -23,7 +23,7 @@
# submission which may include code serialization. Therefore, be very
# conservative in the latency estimation to suppress too aggressive stealing
# of small tasks
LATENCY = 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is realistic, see #5390

@@ -199,8 +199,7 @@ def steal_time_ratio(self, ts):

ws = ts.processing_on
compute_time = ws.processing[ts]
if compute_time < 0.005: # 5ms, just give up
return None, None
compute_time = max(compute_time, 0.010)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been down this road before, see #4920

I also noticed the way we measure bandwidths is very skewed which impacts this significantly, see #4962 (comment) Not sure if this is still relevant but it was big back then

I still think #4920 is worth getting in but I had to drop this back then because deadlocks popped up.

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stealing small tasks (like the splits of a map_overlap or a shuffle) might be a good idea in some special cases (like scale-up), but it's a bad idea in others. I'm wary of just changing some magic numbers here to make this case work.

Isn't this change a bit like a different way of implementing #4925? There was a good reason not to merge that #4925 (comment); I imagine it might apply to this too.

I think the conversation in #4471 (comment) was going in the right direction in identifying flaws in the stealing and decide_worker heuristics during scale-up.

To me, the reason to steal a map_overlap split onto a new worker doesn't have to do with the task's duration, it has to do with the duration of the dependencies it opens up, and with the imbalance in occupancy between workers. So it would feel odd if those factors didn't come into play in how we change the decision.

@@ -3441,7 +3441,7 @@ def get_task_duration(self, ts: TaskState) -> double:
"""
duration: double = ts._prefix._duration_average
if duration >= 0:
return duration
return max(0.005, duration)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a large change to occupancy calculations that would affect things beyond stealing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. It's not unreasonable though. getitem tasks have durations like 62us, which I think is not reflective of general overhead. I think that it probably makes sense for us to set a minimum to our compute times more broadly.

@@ -23,7 +23,7 @@
# submission which may include code serialization. Therefore, be very
# conservative in the latency estimation to suppress too aggressive stealing
# of small tasks
LATENCY = 0.1
LATENCY = 0.01
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrealistically fast to me (and goes against what the comment above is saying). We've talked about doing the opposite in other places #5324. If we need to artificially suppress the latency estimate, maybe that points to something else being wrong with the stealing heuristics.

@mrocklin
Copy link
Member Author

Stealing small tasks (like the splits of a map_overlap or a shuffle) might be a good idea in some special cases (like scale-up), but it's a bad idea in others. I'm wary of just changing some magic numbers here to make this case work.

Yeah, sorry, I probably need to learn to use the draft PR feature more often. Often I'll throw up PRs just to get thoughts on them. This is one. Please don't interpret this as "I plan to merge this right now". Indeed, based on @fjetter 's comments I'm more inclined to go play with #4920

Stealing fast tasks is often a bad idea, yes, but we make it artificially bad. We might want to relax some of those constraints.

@mrocklin
Copy link
Member Author

Isn't this change a bit like a different way of implementing #4925? There was a good reason not to merge that #4925 (comment); I imagine it might apply to this too.

Thanks for pointing me back there. I'll take a look.

@fjetter
Copy link
Member

fjetter commented Apr 14, 2022

I want to emphasise that whatever we do is highly sensitive to the bandwidth measurements
See #5324 and particularly #5324 (comment)
which is still there

if total_bytes > 1_000_000:

also
#4962 (comment)

TLDR

  • I believe we should reduce the default bandwidth and keep default latencies high. overestimating costs (i.e. underestimating bandwidths, overestimating latencies) is safe and we can learn and improve while the cluster is running without making any bad decisions
  • The way the scheduler measures bandwidths is slow, biased and depends on whether or not workers even participate in worker<->worker communication. E.g. if half of the workers would just be idling, they would suggest to the scheduler a very wrong bandwidth since they contribute equally to workers under heavy load.
  • Particularly latencies which dominate small transfers are very strongly underestimated by not updating any bandwidths for small transfers
  • Maybe we should start also measuring latencies

I never dared to modify this without solid measurements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor work scheduling when cluster adapts size
4 participants