Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paused workers shouldn't steal tasks #5665

Merged
merged 6 commits into from Jan 24, 2022
Merged

Conversation

crusaderky
Copy link
Collaborator

Closes #5664

idle = s.idle.values()
saturated = s.saturated
# Paused and closing workers must never become thieves
idle = [ws for ws in s.idle.values() if ws.status == Status.running]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the big-o from O(1) to O(n) of the number of workers, for tasks without restrictions, and only if there are some saturated workers (otherwise the time of topk below, which is O(n), dominates).
I'm unconcerned about this - even with 1k~10k workers, this method runs every 100ms so this should be inconsequential.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here is to filter out paused workers. I'm wondering why a paused worker would be classified as idle in the first place. Is paused and idle not counterintuitive?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idle worker is a worker that isn't processing any tasks.
A paused worker is a worker that isn't going to process any new tasks beyond the ones currently running (not queued) on it.
According to this definition, a paused worker may or may not be idle.

We could, of course, change the definition of idle to "not processing any tasks AND a prime candidate for new tasks", but I feel that would be counter-intuitive.

@@ -816,28 +817,47 @@ async def test_steal_twice(c, s, a, b):

while len(s.tasks) < 100: # tasks are all allocated
await asyncio.sleep(0.01)
# Wait for b to start stealing tasks
while len(b.tasks) < 30:
await asyncio.sleep(0.01)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, the title of the test, "steal twice", makes no sense!


# Army of new workers arrives to help
workers = await asyncio.gather(*(Worker(s.address, loop=s.loop) for _ in range(20)))
workers = await asyncio.gather(*(Worker(s.address) for _ in range(20)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the changes to stealing.py mean that these new workers will start stealing slightly later than before, since they now have to wait to reach running status. It remains to be seen if this causes flakiness.

@crusaderky crusaderky marked this pull request as ready for review January 17, 2022 18:29
@crusaderky
Copy link
Collaborator Author

crusaderky commented Jan 17, 2022

test_close_gracefully has become very flaky - must investigate.

@crusaderky
Copy link
Collaborator Author

The flakiness in test_close_gracefully is unrelated to this change - see #5677.
Ready for final review and merge.

@crusaderky crusaderky marked this pull request as ready for review January 21, 2022 12:28
@crusaderky crusaderky self-assigned this Jan 21, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 21, 2022
@fjetter fjetter merged commit ccad288 into dask:main Jan 24, 2022
@crusaderky crusaderky deleted the AMM/paused_steal branch January 24, 2022 15:01
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 28, 2022
commit 14bddba1482de035ca416368f918d94b115d3660
Merge: df079fc 9a66b71
Author: crusaderky <crusaderky@gmail.com>
Date:   Tue Jan 25 17:16:58 2022 +0000

    Merge branch 'main' into AMM/RetireWorker

commit df079fc
Merge: 9676934 5835ce3
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 15:08:51 2022 +0000

    Merge branch 'AMM/staging' into AMM/RetireWorker

commit 5835ce3
Merge: ee68b02 b0b8e95
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 15:07:27 2022 +0000

    Merge branch 'AMM/test_close_gracefully' into AMM/staging

commit ee68b02
Merge: 7d4e6ee ccad288
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 15:06:50 2022 +0000

    Merge branch 'main' into AMM/staging

commit b0b8e95
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 14:32:29 2022 +0000

    Code review

commit 9676934
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:50:53 2022 +0000

    AMM to manage retire_workers()

commit 7d4e6ee
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:39:41 2022 +0000

    Fix flaky test_close_gracefully and test_lifetime (dask#5677)

commit 7faab51
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:39:41 2022 +0000

    harden test

commit aef3b71
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:29:10 2022 +0000

    Increase resilience on slow CI

commit af84e40
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:39:18 2022 +0000

    Dump cluster state on all test failures (dask#5674)

commit 5054c19
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:38:38 2022 +0000

    Paused workers shouldn't steal tasks (dask#5665)

commit eadb35f
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:37:48 2022 +0000

    transition flight to missing if no who_has (dask#5653)

commit 581aee8
Merge: 940bb45 60c0d60
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:36:09 2022 +0000

    Merge branch 'main' into AMM/test_close_gracefully

commit 940bb45
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:20:10 2022 +0000

    tweak comment

commit 731d132
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:12:03 2022 +0000

    Fix flaky test_close_gracefully and test_lifetime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Paused workers shouldn't steal tasks (flaky test_avoid_paused_workers)
2 participants