Skip to content
/ DisCoTec Public

MPI-based code for distributed HPC simulations with the sparse grid combination technique

License

Notifications You must be signed in to change notification settings

SGpp/DisCoTec

Repository files navigation

DisCoTec: Distributed Combination Technique Framework

Build Status License: LGPL v3 Zenodo DOI Latest spack version Codacy Badge

What is DisCoTec?

This project contains DisCoTec, a code for running the distributed sparse grid combination technique with MPI parallelization. While it originates from the excellent SGpp project, all the parallelization makes it a very different code, such that it has become its own project.

DisCoTec is designed as a framework that can run multiple instances of a (black-box) grid-based solver implementation. The most basic example we use is a mass-conserving FDM/FVM constant advection upwinding solver. An example of a separate, coupled solver is SeLaLib.

Sparse Grid Combination Technique with Time Stepping

The sparse grid combination technique (Griebel et al. 1992, Garcke 2013, Harding 2016) can be used to alleviate the curse of dimensionality encountered in high-dimensional simulations. Instead of using your solver on a single structured full grid (where every dimension is finely resolved), you would use it on many different structured full grids (each of them differently resolved). We call these coarsely-resolved grids component grids. Taken together, all component grids form a sparse grid approximation, which can be explicitly obtained by a linear superposition of the individual grid functions, with the so-called combination coefficients.

schematic of a combination scheme in 2D

In this two-dimensional combination scheme, all combination coefficients are 1 and -1, respectively. Figure originally published in (Pollinger 2024).

Between time steps, the grids exchange data through a multi-scale approach, which is summarized as the "combination" step in DisCoTec. Assuming a certain smoothness in the solution, this allows for a good approximation of the finely-resolved function, while achieving drastic reductions in compute and memory requirements.

Parallelism in DisCoTec

The DisCoTec framework can work with existing MPI parallelized solver codes operating on structured grids. In addition to the parallelism provided by the solver, it adds the combination technique's parallelism. This is achieved through process groups (pgs): MPI_COMM_WORLD is subdivided into equal-sized process groups (and optionally, a manager rank).

schematic of MPI ranks in DisCoTec

The image describes the two ways of scaling up: One can either increase the size or the number of process groups. Figure originally published in (Pollinger 2024).

All component grids are distributed to process groups (either statically, or dynamically through the manager rank). During the solver time step and most of the combination, MPI communication only happens within the process groups. Conversely, for the sparse grid reduction using the combination coefficients, MPI communication only happens between a rank and its colleagues in the other process groups, e.g., rank 0 in group 0 will only talk to rank 0 in all other groups. Thus, major bottlenecks arising from global communication can be avoided altogether.

Combining the two ways of scaling up, DisCoTec's scalability was demonstrated on several machines, with the experiments comprising up to 524288 cores:

timings for advection solver step on HAWK at various parallelizationstimings for combination step on HAWK at various parallelizations

We see the timings (in seconds) for the advection solver step and the combination step, respectively. This weak scaling experiment used four OpenMP threads per rank, and starts with one pg of four processes in the upper left corner. The largest parallelization is 64 pgs of 2048 processes each. Figure originally published in (Pollinger 2024).

This image also describes the challenges in large-scale experiments with DisCoTec: If the process groups become too large, the MPI communication of the multiscale transform starts to dominate the combination time. If there are too many pgs, the combination reduction will dominate the combination time. However, the times required for the solver stay relatively constant; they are determined by the solver's own scaling and the load balancing quality.

There are only few codes that allow weak scaling up to this problem size: a size that uses most of the available main memory of the entire system.

Contributing

We welcome contributions! To find a good place to start coding, have a look at the currently open issues.

  • Please describe issues and intended changes in the issue tracker.
  • Please develop new features in a new branch (typically on your own fork) and then create a pull request.
  • New features will only be merged to the main branch if they are sufficiently tested: please add unit tests in /tests.

Installing

Installation instructions: spack

DisCoTec can be installed via spack, which handles all dependencies. If you want to develop DisCoTec code or examples, the spack dev-build workflow is recommended as follows.

Clone both spack and DisCoTec to find or build the dependencies and then compile DisCoTec:

git clone git@github.com:spack/spack.git  # use https if your ssh is not set up on github
./spack/bin/spack external find  # find already-installed packages
./spack/bin/spack compiler find  # find compilers present on system
./spack/bin/spack info discotec@main  # shows DisCoTec's variants
./spack/bin/spack spec discotec@main  # shows DisCoTec's dependency tree and which parts are already found

git clone git@github.com:SGpp/DisCoTec.git
cd DisCoTec
../spack/bin/spack dev-build -b install discotec@main

This will first build all dependencies, and then build DisCoTec inside the cloned folder. The executables are placed in the respective example and test folders.

Installation instructions: CMake

Dependencies

cmake >= (3.24.2), libmpich-dev (>= 3.2-6), or other MPI library libboost-all-dev (>= 1.60)

Additional (optional) dependencies:

  • OpenMP
  • HDF5
  • HighFive

Complete build

git clone https://github.com/SGpp/DisCoTec.git DisCoTec
cmake -S DisCoTec -B DisCoTec/build
cmake --build DisCoTec/build -j

Advanced build

All examples and the tools can be built on their own. To build a specific example, run the specific cmake command in the example folder. For example, run the following commands

git clone https://github.com/SGpp/DisCoTec.git DisCoTec
cd DisCoTec/examples/combi_example
cmake -S . -B build
cmake --build build -j

to build the combi_example.

For building only the DisCoTec library, run cmake with the src folder as source folder.

Optional CMake Options

  • DISCOTEC_TEST=**ON**|OFF - Build tests if you build the complete project.
  • DISCOTEC_BUILD_MISSING_DEPS=**ON**|OFF- First order dependencies that are not found are built automatically (glpk is always built).
  • DISCOTEC_TIMING=**ON**|OFF - Enables internal timing
  • DISCOTEC_USE_HDF5=**ON**|OFF
  • DISCOTEC_USE_HIGHFIVE=**ON**|OFF - Enables HDF5 support via HighFive. If DISCOTEC_USE_HIGHFIVE=ON, DISCOTEC_USE_HDF5 has also to be ON.
  • DISCOTEC_UNIFORMDECOMPOSITION=**ON **|OFF - Enables the uniform decomposition of the grid.
  • DISCOTEC_GENE=ON|**OFF** - Currently GEne is not supported with CMake!
  • DISCOTEC_OPENMP=ON|**OFF** - Enables OpenMP support.
  • DISCOTEC_ENABLEFT=ON|**OFF** - Enables the use of the FT library.
  • DISCOTEC_USE_LTO=**ON**|OFF - Enables link time optimization if the compiler supports it.
  • DISCOTEC_OMITREADYSIGNAL=ON|**OFF** - Omit the ready signal in the MPI communication. This can be used to reduce the communication overhead.
  • DISCOTEC_USENONBLOCKINGMPICOLLECTIVE=ON|**OFF** - Flag currently unused
  • DISCOTEC_WITH_SELALIB=ON|**OFF** - Looks for SeLaLib dependencies and compiles the matching example

To run the compiled tests, go to folder tests and run

mpiexec -np 9 ./test_distributedcombigrid_boost

where you can use all the parameters of the boost test suite. If timings matter, consider the pinning described in the respective section. Or you can run the tests with ctest in the build folder.

Executing DisCoTec Binaries

DisCoTec executables are configured through ctparam files, which are parsed on startup. The ctparam file will contain the combination technique parameters (dimension, minimum and maximum level) as well as parallelization parameters (number and size of process groups, parallelism within process groups) in .ini file format.

Any DisCoTec executable must be run through MPI (either mpirun or mpiexec), and if no argument to the binary is specified, it will use the file called ctparam in the current working directory. Make sure that it exists and describes the parameters you want to run.

The exact format and naming in ctparam is not (yet) standardized, to allow adaptation for different solver applications. Please refer to existing parameter files.

Rank and Thread Pinning

The number and size of process groups (in MPI ranks) can be read from the ctparam files:

NGROUP=$(grep ngroup ctparam | awk -F"=" '{print $2}')
NPROCS=$(grep nprocs ctparam | awk -F"=" '{print $2}')

Correct pinning is of utmost importance to performance, especially if DisCoTec was compiled with OpenMP support. The desired pinning is the simplest one you can imagine: Each node and, within a node, each socket is filled consecutively with MPI rank numbers. If OpenMP is used, MPI pinning is strided by the number of threads. The threads should be placed close to their MPI rank. This means for all compilers, if the desired number of OpenMP threads per rank is N, one needs

export OMP_NUM_THREADS=$N;export OMP_PLACES=cores;export OMP_PROC_BIND=close

If OpenMP is used for compilation but should not be used for execution, N should be set to 1 to avoid unintended effects.

Intel-MPI

Intel-MPI requires some environment variables, in particular for OpenMP:

mpiexec -n $(($NGROUP*$NPROCS)) -genv I_MPI_PIN_PROCESSOR_LIST="allcores" -genv I_MPI_PIN_DOMAIN=omp $DISCOTEC_EXECUTABLE

In SLURM, it is important to set --ntasks-per-node to match the number of desired tasks ($CORES_PER_NODE/$OMP_NUM_THREADS). Validate with verbose output: export I_MPI_DEBUG=4

OpenMPI

OpenMPI uses command line arguments, pinning may clash with SLURM settings depending on the exact call.

mpiexec.openmpi --rank-by core --map-by node:PE=${OMP_NUM_THREADS} -n $(($NGROUP*$NPROCS)) $DISCOTEC_EXECUTABLE

Validate with verbose output: --report-bindings

MPT

With mpt, one uses the omplace wrapper tool to set the correct pinning.

mpirun -n $(($NGROUP*$NPROCS)) omplace -v -nt ${OMP_NUM_THREADS} $DISCOTEC_EXECUTABLE

Validate with very verbose output: -vv .

GENE submodules as dependencies for GENE examples

Warning: The CMake Integration is currently not adapted to use GENE!

There are gene versions as submodules: a linear one in the gene_mgr folder, and a nonlinear one in gene-non-linear. To get them, you need access to their respective repos at MPCDF. Then go into the folder and

git submodule init
git submodule update

or use the --recursive flag when cloning the repo.

Further Readmes