Replies: 1 comment 3 replies
-
I wonder if there is interesting information in the logs of these workers.
Typically tools like SGE or PBS will store logs somewhere. Those logs
might contain valuable clues about what is going wrong.
…On Tue, Mar 23, 2021 at 4:39 AM Thomas Zilio ***@***.***> wrote:
I did not really have any success on stackoverflow
<https://stackoverflow.com/questions/66221545/dask-handling-unresponsive-workers>
so I'm trying here :)
When using Dask with SGE or PBS clusters I sometimes have workers becoming
unresponsive.
These workers are highlighted in red in the dashboard Info section with
their "Last seen" number constantly increasing.
I know this can happen if submitted tasks hold the GIL for too long but
that's not the case here. I'm talking about workers for which something
went wrong (probably unrelated to dask or the task itself).
They will not come back and are not detected as dead either.
The problem is that tasks submitted on these workers (they become
unresponsive after receiving a task, maybe when loading the environment)
never end and block everything.
Is there a setting allowing to "timeout" or "invalidate" a worker if it
was unresponsive for a given amount of time ?
If not, is it possible and what would be the recommended way to manually
do this invalidation and dispatch remaining tasks on other workers ?
Thanks in advance for any help regarding this issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7451>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCRE3QTTU6AQ4U4YI3TFBOUTANCNFSM4ZUZXSWQ>
.
|
Beta Was this translation helpful? Give feedback.
3 replies
Answer selected by
Thomas-Z
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I did not really have any success on stackoverflow so I'm trying here :)
When using Dask with SGE or PBS clusters I sometimes have workers becoming unresponsive.
These workers are highlighted in red in the dashboard Info section with their "Last seen" number constantly increasing.
I know this can happen if submitted tasks hold the GIL for too long but that's not the case here. I'm talking about workers for which something went wrong (probably unrelated to dask or the task itself).
They will not come back and are not detected as dead either.
The problem is that tasks submitted on these workers (they become unresponsive after receiving a task, maybe when loading the environment) never end and block everything.
Is there a setting allowing to "timeout" or "invalidate" a worker if it was unresponsive for a given amount of time ?
If not, is it possible and what would be the recommended way to manually do this invalidation and dispatch remaining tasks on other workers ?
Thanks in advance for any help regarding this issue.
Beta Was this translation helpful? Give feedback.
All reactions