SLURMCluster - declaring multiple nodes within single sbatch
command
#7467
-
Hi! I'm working on my msc project in which I'm using available Slurm HPC cluster at my university for distributed computing. Recently I've been experiencing some issues with a resource allocation and supercomputer's helpdesk indicated that the reason might be how For executing one experiment I need 600 cores for about an hour. Here one node contains 24 cores so my dask-jobqueue setup looks like this: cluster = SLURMCluster(queue=os.environ['PARTITION'],
project=os.environ['GRANT'],
cores=24,
processes=24,
memory='48 GB',
walltime='01:00:00',
interface='ib0',
scheduler_options={'interface': 'eth55'})
print(cluster.job_script())
cluster.scale(jobs=25) The output of
And
While executing few experiments I experienced some issues with the resource acquisition. After a few runs the time in PENDING state became much longer and right now I need to wait +16h to get resources. I asked cluster's helpdesk about it and I got the answer that if single experiment execution relies on multiple They pointed out that correct way to acquire 600 cores is to use options: I've looked into My question is: How can I declare 600 cores allocation within single Thank you for any guidance and help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
As I understand it,
For workloads like this where you want to submit a job, and you are getting results after an hour, I'd recommend giving
#!/bin/bash -l
#SBATCH -J dask-mpi-job
#SBATCH -p plgrid
#SBATCH -A plg......
#SBATCH --cpus-per-task=24
#SBATCH --mem=1118G
#SBATCH -t 01:00:00
#SBATCH --nodes=25
echo "Running Dask-MPI"
source activate my_conda_env_name # Activate conda environment
mpirun -np 600 python myscript.py
from dask_mpi import initialize
initialize()
from dask.distributed import Client
client = Client() # Connect this local process to remote workers
# The rest of your code goes here
# .....
And then submit the batch job: sbatch dask-mpi-batch.sh I hope this helps. |
Beta Was this translation helpful? Give feedback.
@mtsokol,
As I understand it,
dask-jobqueue
won't help you in this particular case becausedask-jobqueue
assumes that settings defined inSLURMCluster()
correspond to a single job on one node. As far asdask-jobqueue
is concerned passing-N25
or--nodes=25
to SLUMCluster won't make sense.For workloads like this where you want to submit a job, and you are getting results after an hour, I'd recommend giving
dask-mpi
a chance. Withdask-mpi
you should be able to customize how resources are allocated by your batch queueing system (For instance, you can specify-N25
, and should be able to get the 600 cores in one big job) and t…