Replies: 1 comment 1 reply
-
In all cases or only occasionally? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
👋 I've been debugging this for north of 3 days, now. 😅 The cluster in question uses SLURM, and from what we (sysadmins and I) can tell something within Dask is triggering
q.result()
to pass along adistributed.scheduler.KilledWorker
exception.Configuration
$ python -V Python 3.9.7 $ conda list | grep dask dask 2021.11.2 pyhd8ed1ab_0 conda-forge dask-core 2021.11.2 pyhd8ed1ab_0 conda-forge dask-jobqueue 0.7.3 pyhd8ed1ab_0 conda-forge
High-level view of the execution process is:
itertool.product(...parameters)
fortask_fn
across workers.task_fn
, the following occurs:tuple(parameter_combo)
subprocess.Popen
.)fire_and_forget
, but it wasn't clear if I could "track" execution progress with it, like I can withFutures
.Notes:
dask-worker
s JobIDs, they're cancelled bydask
, so it appears SLURM isn't the culprit.docker://hello-world
container with no issues.My questions:
distributed.scheduler.KilledWorker: ('task_fn-1502c60a8c470b4c166195a1dbb36d6a', <WorkerState 'tcp://10.31.176.82:45948', name: spatial-pragmatics-54442265-1-4, status: closed, memory: 0, processing: 3>)
? I don't seem able to inspect theparameter_combo
and this usually doesn't happen until fairly late in the execution cycle.dask
trigger theclose_job()
function(s) on workers?Output of
cluster.job_script()
:In the particular run I'm referencing, total walltime was about 59m.
Beta Was this translation helpful? Give feedback.
All reactions