You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g.
srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i
They do this to limit the impact running calibration, etc. has on their head/login nodes.
Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):
partition = cpu_short
However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)
The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:
Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?
The text was updated successfully, but these errors were encountered:
I lean toward saying limiting users for now to srun. I almost think this is more a symptom of calibration needing a rethink to be more stateless and less resource intensive.
We should look if possibly custom sbatch commands could help here as well.
NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g.
srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i
They do this to limit the impact running calibration, etc. has on their head/login nodes.
Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):
However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)
The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:
I have verified that their slurm version is 23.02.3
From:
https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793613/Troubleshooting+Slurm+Jobs
Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?
The text was updated successfully, but these errors were encountered: