Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm platform -- unable to run from compute nodes as expected #2145

Open
ckirkman-IDM opened this issue Oct 16, 2023 · 1 comment
Open

slurm platform -- unable to run from compute nodes as expected #2145

ckirkman-IDM opened this issue Oct 16, 2023 · 1 comment

Comments

@ckirkman-IDM
Copy link
Contributor

NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g.
srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i

They do this to limit the impact running calibration, etc. has on their head/login nodes.

Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):

partition = cpu_short

However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)

The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:

calib_manager.platform = Platform(args.platform, max_running_jobs=1000000, array_batch_size=1000000)

I have verified that their slurm version is 23.02.3

From:
https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793613/Troubleshooting+Slurm+Jobs
image

Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?

@devclinton
Copy link
Member

I lean toward saying limiting users for now to srun. I almost think this is more a symptom of calibration needing a rethink to be more stateless and less resource intensive.

We should look if possibly custom sbatch commands could help here as well.

@ZDu-IDM ZDu-IDM self-assigned this Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants