New directives syntax #819

b-butler · 2024-02-16T15:45:43Z

Description

This PR removes and changes the supported directive names in favor of a more generic structure which has less ambiguous setting and more closely allows for modern scheduler requests. One example is moving np to n_processes which provides some self-documenting effect. We also anchor to n_processes as a basis for most other directives such as n_threads_per_process.

Motivation and Context

There have been multiple issues related to directives behavior or confusion based on naming np and nranks for instance. This clears them up, and simplifies the logic to support various workflows or appropriately disallows them when unlikely to work on a scheduler.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
The changes introduced by this pull request are covered by existing or newly introduced tests.
The package documentation and framework documentation in signac-docs are up to date with these changes.
I have updated the changelog and added any related issue and pull request numbers for future reference.

Supports 3 node types: - shared: 1 node or less - mixed: any number of tasks - wholenode: whole node increments

Also move threshold to _PartitionConfig class and gpus_per_node default to 0.

This is useful for various group/bundling resource aggregation.

We use these directives internally to help with scheduling. Even if used in templates, using the per-process first allows for more control over resource request and computing the totals later is cheap and simple.

Uses new directives and -per-task options.

bcrawford39GT · 2024-03-07T15:27:38Z

flow/directives.py

+        directives["gpus"] = directives["processes"] * directives["gpus_per_process"]
+        if (memory := directives["memory_per_cpu"]) is not None:
+            directives["memory"] = directives["cpus"] * memory
+        else:


Georgia Tech's slurm schedular also uses "memory_per_gpu", so if this is the appropriate place, do we need to add the below here also?

elif (memory := directives["memory_per_gpu"]) is not None: directives["memory"] = directives["gpus"] * memory

memory_per_gpu is not a directive. In this new scheme processes is the base unit on which gpus and memory are based. If your users need a specific amount of memory per GPU, they could set:

directives={'processes': N, 'gpus_per_process': G, 'memory_per_process': memory_per_gpu * G}

Wait, I got that wrong. The directive is (currently) memory_per_cpu, so you would have to calculate that based on the number of cpus per gpu.

I originally suggested memory_per_cpu because that provided a 1:1 mapping to SLURM. However, @b-butler explained that the complexities of bundling and groups essentially requires computing an aggregate memory and the dividing to issue the proper SLURM request. Thus, we could change the user facing directive to memory_per_process.

In either case, the information will be there for the Georgia Tech's template to produce --memory-per-gpu if that is a hard requirement.

--memory-per-gpu is a variable in the Georgia Tech's submission script. We can enter it in the custom any custom value as log as we can import the in the header block values (example https://github.com/bcrawford39GT/signac_numpy_tutorial/blob/main/signac_numpy_tutorial/project/templates/phoenix.sh), without Signac auto placing a default --mem option that can not be removed, This is the main issue, meaning if we add anything the default --mem option is also added, causing 2 conflicting variables. Maybe there can be a variable to not autoprint --mem?

I think the memory_per_process option would be fine instead of --memory-per-cpu and --memory-per-gpu, as long as they can import the value in the header block values (example https://github.com/bcrawford39GT/signac_numpy_tutorial/blob/main/signac_numpy_tutorial/project/templates/phoenix.sh) for manipulation if needed, and Signac does not write --mem automatically. Alternatively, maybe you are saying memory_per_process will be automatically written to --mem ?

If original --memory-per-cpu and --memory-per-gpu it is not feasible to do that is fine. I was under the impression it was OK, not an issue, and in progress.

We will try and adapt whatever is feasible to do. I am trying to understand the viable options or direction.

The new default template will always write --mem-per-cpu to the slurm file, not --mem. We can put this in a {% block memory %} so that child scripts can override it if needed.

OK. As long as the user has an option to stop the default writing of the new default --mem-per-cpu also that should be fine. We just can not specify it 2x+ or it will error out as overspecified

I wish we didn't need memory at all, but we do. A number of clusters that flow supports default to a ridiculously small amount of memory per rank. Users therefore need the ability to request more. It is the responsibility of the user to request an appropriate amount that does not result in extra charges.

The systems that flow needs to both a) combine memory requests across bundles/groups and b) provide the --mem-per-cpu needed for srun to correctly launch jobs are complex.

I agree. We can work with the final solution that works for everyone. Just wanted to state the barriers to things, so you are aware.

I wish Memory was free also, on GPUs too! No need to buy commercial cards then

b-butler · 2024-03-15T13:43:32Z

Sorry for the delay, I have uploaded my current changes.

Some things that need to be done:

Regenerate the template test standard (the generation script currently fails and needs to be debugged)
Add in support for OMP which requires modifying the launcher syntax.
Testing on clusters to validate correctness
Adding new tests of the expected behavior

Currently, the code once debugged supports homogenous MPI, heterogeneous nonMPI, and mixed MPI/nonMPI submissions.

b-butler added 19 commits November 1, 2023 16:47

refactor: Create partition and partition config classes

9e220e0

fix: argument ordering of _Partition

48d3c03

refactor: Switch shared_partition for node_types

a18bf6a

Supports 3 node types: - shared: 1 node or less - mixed: any number of tasks - wholenode: whole node increments

refactor: Raise more template errors in Python

45f47f8

doc/refactor: Document new classes

ba63e0f

Also move threshold to _PartitionConfig class and gpus_per_node default to 0.

test: add force=True to submit (handle ngpu directives)

14f4076

test: Test new errors in _Partition

c3ccf3d

refactor (WIP): Switch over directives to new structure

aaa0e7f

refactor: Split _JobOperation into _JobOperation and _RunOperation

5be1991

test: Remove fork directive tests.

20bf55a

fix: Errors in run operations from no _RunOperation.id

92a6ec6

test: Fix test of run_options

0244e00

feat: Make _Directives.evaluate returns self.

2782aaf

Merge branch 'main' into refactor/environment-config

fed4c6e

Merge branch 'refactor/environment-config' into refactor/new-directives

2c9ba3a

Merge branch 'main' into refactor/new-directives

dcd9250

feat (WIP): "Working" submission logic for slurm.

e40dc2b

Merge branch 'main' into refactor/new-directives

7ad0419

Merge branch 'main' into refactor/new-directives

ad56f23

This was referenced Feb 22, 2024

Provide helpful error message when double-initializing MPI. #824

Closed

Always write number of nodes on expanse #827

Merged

b-butler added 9 commits February 29, 2024 15:38

feat: Add None tolerance max, sum, argmax functions

5f3a845

refactor: _Directives.evaluate now computes total cpus, gpus, and memory

36b303e

This is useful for various group/bundling resource aggregation.

refactor: Use total cpu, gpu, and memory internal directives

76f300c

We use these directives internally to help with scheduling. Even if used in templates, using the per-process first allows for more control over resource request and computing the totals later is cheap and simple.

refactor: MPI/cmd prefix logic

2123989

Uses new directives and -per-task options.

refactor: Remove Summit's calc_num_nodes

ca805eb

refactor: Update template filters

4701573

refactor: Attempt to update drexel configuration

3b03c5a

refactor: Update remaining templates to new directives

ea51b43

fix: some documention and miscellaneous code

1156b4f

b-butler added 2 commits February 29, 2024 15:52

test: Update tests to new directives

1aec7f5

feat: Allow mixed MPI/nonMPI operation submission.

b523c7a

bcrawford39GT reviewed Mar 7, 2024

View reviewed changes

cbkerr mentioned this pull request Mar 14, 2024

Add jinja2 block around memory options #831

Merged

6 tasks

b-butler added 4 commits March 15, 2024 09:24

Merge branch 'main' into refactor/new-directives

70b2778

(WIP)fix: bugs surrounding directives processing

8f007b5

test: Correct template test project launcher values

c86c4d4

test: Update template tests configuration generation

c17fd6f

joaander mentioned this pull request May 28, 2024

Refactor directives. #785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New directives syntax #819

New directives syntax #819

b-butler commented Feb 16, 2024

bcrawford39GT Mar 7, 2024

joaander Mar 7, 2024

joaander Mar 7, 2024

bcrawford39GT Mar 7, 2024 •

edited

joaander Mar 7, 2024

bcrawford39GT Mar 7, 2024 •

edited

joaander Mar 7, 2024

bcrawford39GT Mar 7, 2024

bcrawford39GT Mar 7, 2024

b-butler commented Mar 15, 2024

New directives syntax #819

Are you sure you want to change the base?

New directives syntax #819

Conversation

b-butler commented Feb 16, 2024

Description

Motivation and Context

Checklist:

bcrawford39GT Mar 7, 2024

Choose a reason for hiding this comment

joaander Mar 7, 2024

Choose a reason for hiding this comment

joaander Mar 7, 2024

Choose a reason for hiding this comment

bcrawford39GT Mar 7, 2024 • edited

Choose a reason for hiding this comment

joaander Mar 7, 2024

Choose a reason for hiding this comment

bcrawford39GT Mar 7, 2024 • edited

Choose a reason for hiding this comment

joaander Mar 7, 2024

Choose a reason for hiding this comment

bcrawford39GT Mar 7, 2024

Choose a reason for hiding this comment

bcrawford39GT Mar 7, 2024

Choose a reason for hiding this comment

b-butler commented Mar 15, 2024

bcrawford39GT Mar 7, 2024 •

edited

bcrawford39GT Mar 7, 2024 •

edited