[dask-jobqueue] "KilledWorker" mysteriously cropping up with no informative errors/logs #8471

jmuchovej · 2021-12-10T09:17:38Z

jmuchovej
Dec 10, 2021

👋 I've been debugging this for north of 3 days, now. 😅 The cluster in question uses SLURM, and from what we (sysadmins and I) can tell something within Dask is triggering q.result() to pass along a distributed.scheduler.KilledWorker exception.

Configuration

$ python -V
Python 3.9.7
$ conda list | grep dask
dask                      2021.11.2          pyhd8ed1ab_0    conda-forge
dask-core                 2021.11.2          pyhd8ed1ab_0    conda-forge
dask-jobqueue             0.7.3              pyhd8ed1ab_0    conda-forge

High-level view of the execution process is:

Distribute itertool.product(...parameters) for task_fn across workers.
Within task_fn, the following occurs:
1. Create a temporary configuration file based on the tuple(parameter_combo)
2. Run a Singularity container (based on Docker) that executes some Node.js code using the temporary configuration from (1). (Using subprocess.Popen.)
3. Write the results to disk (done within Singularity).
4. "Return" (there's nothing useful to manipulate). (I would use fire_and_forget, but it wasn't clear if I could "track" execution progress with it, like I can with Futures.

Notes:

Inspecting the dask-workers JobIDs, they're cancelled by dask, so it appears SLURM isn't the culprit.
I've tried running the workers with pure Python functions and there's no trouble.
Similarly, I've tried the docker://hello-world container with no issues.
Lastly, using my Singularity/Docker container on a compute node and manually starting it doesn't appear to cause any qualms.

My questions:

Is there a way to gather something more informative than distributed.scheduler.KilledWorker: ('task_fn-1502c60a8c470b4c166195a1dbb36d6a', <WorkerState 'tcp://10.31.176.82:45948', name: spatial-pragmatics-54442265-1-4, status: closed, memory: 0, processing: 3>)? I don't seem able to inspect the parameter_combo and this usually doesn't happen until fairly late in the execution cycle.
What are likely reasons to "make" dask trigger the close_job() function(s) on workers?
Is there a way to "gracefully" recover from this? (i.e. I don't really care if I lose the work/er, I can either redistribute the workload elsewhere or request more compute resources.)

Output of `cluster.job_script()`:

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -e dask/dask-worker-%J.err
#SBATCH -o dask/dask-worker-%J.out
#SBATCH -p shared
#SBATCH -n 1
#SBATCH --cpus-per-task=24
#SBATCH --mem=19G
#SBATCH -t 02:00:00

/path/to/python -m distributed.cli.dask_worker tcp://10.31.176.52:35250 --nthreads 4 --nprocs 6 --memory-limit 3.10GiB --name spatial-pragmatics-$SLURM_JOB_ID-12 --nanny --death-timeout 300 --local-directory /path/to/scratch/ --lifetime 115m --lifetime-stagger 4m --interface ib0 --protocol tcp://

In the particular run I'm referencing, total walltime was about 59m.

RichardScottOZ · 2022-03-27T07:21:58Z

RichardScottOZ
Mar 27, 2022

In all cases or only occasionally?

1 reply

jmuchovej Mar 28, 2022
Author

I don't recall. 😕 I believe this was caused by being on an SMB-like mount. I resolved this by moving to higher-speed network filesystems and using the local node's TMPFS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask-jobqueue] "KilledWorker" mysteriously cropping up with no informative errors/logs #8471

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

[dask-jobqueue] "KilledWorker" mysteriously cropping up with no informative errors/logs #8471

jmuchovej Dec 10, 2021

Configuration

High-level view of the execution process is:

Notes:

My questions:

Output of cluster.job_script():

Replies: 1 comment · 1 reply

RichardScottOZ Mar 27, 2022

jmuchovej Mar 28, 2022 Author

jmuchovej
Dec 10, 2021

Output of `cluster.job_script()`:

Replies: 1 comment 1 reply

RichardScottOZ
Mar 27, 2022

jmuchovej Mar 28, 2022
Author