Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow stealing of fast tasks in some situations #6115

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion distributed/scheduler.py
Expand Up @@ -3441,7 +3441,7 @@ def get_task_duration(self, ts: TaskState) -> double:
"""
duration: double = ts._prefix._duration_average
if duration >= 0:
return duration
return max(0.005, duration)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a large change to occupancy calculations that would affect things beyond stealing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. It's not unreasonable though. getitem tasks have durations like 62us, which I think is not reflective of general overhead. I think that it probably makes sense for us to set a minimum to our compute times more broadly.


s: set = self._unknown_durations.get(ts._prefix._name) # type: ignore
if s is None:
Expand Down
5 changes: 2 additions & 3 deletions distributed/stealing.py
Expand Up @@ -23,7 +23,7 @@
# submission which may include code serialization. Therefore, be very
# conservative in the latency estimation to suppress too aggressive stealing
# of small tasks
LATENCY = 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is realistic, see #5390

LATENCY = 0.01
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrealistically fast to me (and goes against what the comment above is saying). We've talked about doing the opposite in other places #5324. If we need to artificially suppress the latency estimate, maybe that points to something else being wrong with the stealing heuristics.


logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -199,8 +199,7 @@ def steal_time_ratio(self, ts):

ws = ts.processing_on
compute_time = ws.processing[ts]
if compute_time < 0.005: # 5ms, just give up
return None, None
compute_time = max(compute_time, 0.010)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been down this road before, see #4920

I also noticed the way we measure bandwidths is very skewed which impacts this significantly, see #4962 (comment) Not sure if this is still relevant but it was big back then

I still think #4920 is worth getting in but I had to drop this back then because deadlocks popped up.


nbytes = ts.get_nbytes_deps()
transfer_time = nbytes / self.scheduler.bandwidth + LATENCY
Expand Down