SLURMCluster - declaring multiple nodes within single `sbatch` command #7467

mtsokol · 2021-03-25T11:27:53Z

mtsokol
Mar 25, 2021

Hi!

I'm working on my msc project in which I'm using available Slurm HPC cluster at my university for distributed computing.

Recently I've been experiencing some issues with a resource allocation and supercomputer's helpdesk indicated that the reason might be how dask-jobqueue constructs sbatch script. I wanted to ask if it's possible to construct it as they described.

For executing one experiment I need 600 cores for about an hour. Here one node contains 24 cores so my dask-jobqueue setup looks like this:

cluster = SLURMCluster(queue=os.environ['PARTITION'],
                       project=os.environ['GRANT'],
                       cores=24,
                       processes=24,
                       memory='48 GB',
                       walltime='01:00:00',
                       interface='ib0',
                       scheduler_options={'interface': 'eth55'})
print(cluster.job_script())
cluster.scale(jobs=25)

The output of print(cluster.job_script()) is:

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p plgrid
#SBATCH -A plg......
#SBATCH -n 1
#SBATCH --cpus-per-task=24
#SBATCH --mem=48G
#SBATCH -t 01:00:00

And sacct command shows those 25 pending sbatches:

login01 ~$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
51098858     dask-work+     plgrid plg......+         24    PENDING      0:0
51098859     dask-work+     plgrid plg......+         24    PENDING      0:0
51098860     dask-work+     plgrid plg......+         24    PENDING      0:0
51098861     dask-work+     plgrid plg......+         24    PENDING      0:0
51098862     dask-work+     plgrid plg......+         24    PENDING      0:0
51098864     dask-work+     plgrid plg......+         24    PENDING      0:0
...

While executing few experiments I experienced some issues with the resource acquisition. After a few runs the time in PENDING state became much longer and right now I need to wait +16h to get resources.
Also jobs that are shorter than 1h can be run on plgrid-short partition with shorter pending time but it's common that I might get only e.g. 1/3 of all requested resources and computations aren't fully completed after that hour. Then jobqueue with only those 1/3 timeouted resources fails whole experiment.

I asked cluster's helpdesk about it and I got the answer that if single experiment execution relies on multiple sbatch runs then it is incorrect usage of HPC cluster.

They pointed out that correct way to acquire 600 cores is to use options: --ntasks-per-node=24 to indicate that I need all cores available per node and e.g. -N25 to tell how many nodes I need. This should give access to 600 cores within one sbatch command.

I've looked into SLURMCluster documentation and I haven't found a parameter to declare -N option.
Attempt with cores=600, in SLURMCluster() failed as the limit for a node is 24.

My question is: How can I declare 600 cores allocation within single sbatch command via dask-jobqueue?

Thank you for any guidance and help!

Answered by andersy005

Mar 25, 2021

@mtsokol,

As I understand it, dask-jobqueue won't help you in this particular case because dask-jobqueue assumes that settings defined inSLURMCluster() correspond to a single job on one node. As far as dask-jobqueue is concerned passing -N25 or --nodes=25 to SLUMCluster won't make sense.

For executing one experiment I need 600 cores for about an hour.

For workloads like this where you want to submit a job, and you are getting results after an hour, I'd recommend giving dask-mpi a chance. With dask-mpi you should be able to customize how resources are allocated by your batch queueing system (For instance, you can specify -N25, and should be able to get the 600 cores in one big job) and t…

View full answer

andersy005 · 2021-03-25T13:10:17Z

andersy005
Mar 25, 2021

@mtsokol,

As I understand it, dask-jobqueue won't help you in this particular case because dask-jobqueue assumes that settings defined inSLURMCluster() correspond to a single job on one node. As far as dask-jobqueue is concerned passing -N25 or --nodes=25 to SLUMCluster won't make sense.

For executing one experiment I need 600 cores for about an hour.

For workloads like this where you want to submit a job, and you are getting results after an hour, I'd recommend giving dask-mpi a chance. With dask-mpi you should be able to customize how resources are allocated by your batch queueing system (For instance, you can specify -N25, and should be able to get the 600 cores in one big job) and then dask-mpi will make sure dask is aware of all those resources. Here's how you may be able to achieve this:

dask-mpi-batch.sh:

#!/bin/bash -l

#SBATCH -J dask-mpi-job
#SBATCH -p plgrid
#SBATCH -A plg......
#SBATCH --cpus-per-task=24
#SBATCH --mem=1118G
#SBATCH -t 01:00:00
#SBATCH --nodes=25


echo "Running Dask-MPI"
source activate my_conda_env_name  # Activate conda environment 
mpirun -np 600 python myscript.py

myscript.py:

from dask_mpi import initialize
initialize()

from dask.distributed import Client

client = Client()  # Connect this local process to remote workers

# The rest of your code goes here
# .....

And then submit the batch job:

sbatch dask-mpi-batch.sh

I hope this helps.

1 reply

mtsokol Mar 25, 2021
Author

Thank you! This is what I was looking for!

It makes much more sense to do it that way.

I've rewritten my code as described. Now I create client with:

initialize()
client = Client()

My script file for test run:

#!/bin/bash -l

#SBATCH -J dask-mpi-job
#SBATCH -p plgrid-testing
#SBATCH -A plg.......
#SBATCH --ntasks-per-node=4
#SBATCH --mem=4G
#SBATCH -t 00:15:00
#SBATCH --nodes=2

echo "Running Dask-MPI"
module load plgrid/libs/python-mpi4py/3.0.1-python-3.6
mpirun -np 8 python3 -m implementation.experiment

Then in sacct I see:

51116532     dask-mpi-+ plgrid-te+ plg+          8    RUNNING      0:0
51116532.ba+      batch            plg+          4    RUNNING      0:0
51116532.0    pmi_proxy            plg+          2    RUNNING      0:0

I've managed to successfully execute test experiment within this run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURMCluster - declaring multiple nodes within single `sbatch` command #7467

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

SLURMCluster - declaring multiple nodes within single sbatch command #7467

mtsokol Mar 25, 2021

Replies: 1 comment · 1 reply

andersy005 Mar 25, 2021

mtsokol Mar 25, 2021 Author

SLURMCluster - declaring multiple nodes within single `sbatch` command #7467

mtsokol
Mar 25, 2021

Replies: 1 comment 1 reply

andersy005
Mar 25, 2021

mtsokol Mar 25, 2021
Author