slurm platform -- unable to run from compute nodes as expected #2145

ckirkman-IDM · 2023-10-16T18:20:37Z

NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g.
srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i

They do this to limit the impact running calibration, etc. has on their head/login nodes.

Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):

partition = cpu_short

However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)

The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:

calib_manager.platform = Platform(args.platform, max_running_jobs=1000000, array_batch_size=1000000)

I have verified that their slurm version is 23.02.3

From:
https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793613/Troubleshooting+Slurm+Jobs

Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?

The text was updated successfully, but these errors were encountered:

devclinton · 2023-10-19T16:56:53Z

I lean toward saying limiting users for now to srun. I almost think this is more a symptom of calibration needing a rethink to be more stateless and less resource intensive.

We should look if possibly custom sbatch commands could help here as well.

ckirkman-IDM added bug Something isn't working Platform support User Experience User Support SLURM labels Oct 16, 2023

ZDu-IDM self-assigned this Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurm platform -- unable to run from compute nodes as expected #2145

slurm platform -- unable to run from compute nodes as expected #2145

ckirkman-IDM commented Oct 16, 2023

devclinton commented Oct 19, 2023

slurm platform -- unable to run from compute nodes as expected #2145

slurm platform -- unable to run from compute nodes as expected #2145

Comments

ckirkman-IDM commented Oct 16, 2023

devclinton commented Oct 19, 2023