Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on workers reaching memory.pause threshold #5235

Closed
gerrymanoim opened this issue Aug 19, 2021 · 5 comments
Closed

Deadlock on workers reaching memory.pause threshold #5235

gerrymanoim opened this issue Aug 19, 2021 · 5 comments
Labels
needs info Needs further information from the user

Comments

@gerrymanoim
Copy link
Contributor

Apologies - I'm not sure exactly how to provide a code example that triggers this condition, but here's what I observed:

Sometimes at the end of long runs with 1000s of tasks, I've found that there are straggler tasks that seem to be stuck on workers. These workers seem to be at the memory.pause 0.8 mark and the amount of stuck tasks is equal to the threads available to the dask worker. The workers are heart beating just fine, but don't seem to be actually doing anything with the tasks they're processing (callstacks for each task are blank). Other workers aren't stealing these tasks. When I go kill the workers, the scheduler will go reassign those tasks and everything will complete as normal.

@ncclementi ncclementi added the needs info Needs further information from the user label Sep 17, 2021
@jrbourbeau
Copy link
Member

@fjetter @crusaderky, by chance, have either of you run into this scenario before?

@crusaderky
Copy link
Collaborator

crusaderky commented Oct 14, 2021

Yes. See #3761. It will be fixed within the next few weeks.

@gerrymanoim
Copy link
Contributor Author

Thanks! That's via #5381?

@crusaderky
Copy link
Collaborator

It will likely be a separate PR

@jrbourbeau
Copy link
Member

Thanks @crusaderky! Closing as a duplicate of #3761

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

4 participants