Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: run each simulation on SLURM with multi cores #2247

Open
ZDu-IDM opened this issue Apr 9, 2024 · 3 comments
Open

Investigation: run each simulation on SLURM with multi cores #2247

ZDu-IDM opened this issue Apr 9, 2024 · 3 comments

Comments

@ZDu-IDM
Copy link
Collaborator

ZDu-IDM commented Apr 9, 2024

For each simulation on SLURM, the command we run is _run.sh and it contains the actual command:

singularity exec /home/zdf1921/shared/rocky_dtk_runner_py39.sif Assets/Eradication --config config.json --dll-path ./Assets --input-path ./Assets\;.

It works fine in my testing example and generated the expected results in output folder (as seen on COMPS).

Based on NYU user's e-mail suggestion, I changed the command to the following (add 'mpirun -n 4'):

singularity exec /home/zdf1921/shared/rocky_dtk_runner_py39.sif mpirun -n 4 Assets/Eradication --config config.json --dll-path ./Assets --input-path ./Assets\;.

The file stdout.txt seems like it does use multi cores, however every simulation execution failed.

Please refer to attached file stdout.txt for details.
stdout.txt

@kfrey-idm
Copy link

There can be odd issues when running a simulation where num_cores > num_nodes. It looks like there was probably an unhandled exception on the rank 3 process because it didn't have any nodes assigned:

00:00:01 [3] [I] [Simulation] Rank 3 contributes 0 nodes...
00:00:01 [3] [I] [Simulation] Rank map contents not displayed until NodeRankMap::ToString() (re)implemented.
00:00:01 [3] [W] [Simulation] Rank 3 wasn't assigned any nodes! (# of procs is too big for simulation?)
00:00:01 [3] [I] [Eradication] Controller execution failed, exiting.

For the Generic branch, I've changed the behavior so it's an explicit exception:
https://github.com/InstituteforDiseaseModeling/DtkTrunk/issues/4997

Not sure what the intended behavior is for the Malaria branch, should check with @Bridenbecker.

@ZDu-IDM
Copy link
Collaborator Author

ZDu-IDM commented Apr 17, 2024

More testing on NU SLURM platform, the results are not satisfactory, and I will collect test results and write a test report later.

@ZDu-IDM
Copy link
Collaborator Author

ZDu-IDM commented Apr 27, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants