GT Bench

This benchmark evaluates the performance of GridTools, an open source C++-embedded DSL for weather and climate codes. The benchmark implements an advection-diffusion solver using the finite difference method. The benchmark uses operator splitting for temporal integration typical of weather and climate codes, whereby the two horizontal dimensions use an explicit discretization, and the third, vertical, dimension is implicit.

Dependencies

GT Bench depends on the GridTools library and, optionally, on the GHEX library. These libraries will be installed automatically when building GT Bench with cmake unless specified otherwise.

Further external dependencies are listed below: Required:

CMake (minimum version 3.18.1)
Boost (minimum version 1.73.0)
MPI (for example OpenMPI)

Optional:

Unified Communication X (UCX) (minimun version 1.8.0)
PMIx (minimum version 3.1.4)
xpmem (master is recommended)

Building

Once the external dependencies are installed, CMake can be used to configure the build as follows:

$ cd /PATH/TO/GTBENCH-SOURCE
$ mkdir build && cd build
$ cmake ..

Selecting the GridTools Backend

The GridTools backend specifies which hardware architecture to target. The backend can be selected by setting the GTBENCH_BACKEND option when configuring with CMake:

$ cmake -DGTBENCH_BACKEND=<BACKEND> ..

where <BACKEND> must be either cpu_kfirst, cpu_ifirst, or gpu. The cpu_kfirst and cpu_ifirst backends are two different CPU-backends of GridTools. On modern CPUs with large vector width and/or many cores, the cpu_ifirst backend might perform significantly better. On CPUs without vectorization or small vector width and limited parallelism, the cpu_kfirst backend might perform better. The gpu backend currently supports running NVIDIA CUDA-capable GPUs and AMD HIP-capable GPUs.

Selecting the GPU Compilation Framework

Note This section is only relevant for GPU targets.

For CUDA-capable GPUs, two compilation modes are available, namely compilation of GPU code with the NVIDIA NVCC compiler or alternatively with Clang. If the C++ compiler is set to Clang, the latter is preferred but can be overriden by setting the CMake variable GT_CLANG_CUDA_MODE to NVCC-CUDA.

For AMD GPUs, the CMake C++ compiler has to be set to HIPCC (using the environment variable CXX, as usually with CMake).

Selecting the Runtime

The benchmark implementation brings several runtimes, implementing different scheduling and communication strategies. These can be selected using the CMake variable GTBENCH_RUNTIME:

$ cmake -DGTBENCH_RUNTIME=<RUNTIME> ..

where RUNTIME can be ghex_comm, gcl, simple_mpi, single_node.

The single_node options is useful for performing "single-node" tests to understand kernel performance.
The simple_mpi implementation uses a simple MPI 2 sided communication for halo exchanges.
The gcl implementation uses a optimized MPI based communication library shipped with GridTools.
The ghex_comm option will use highly optimized distributed communication via the GHEX library, designed for best performance at scale. Additionally, this option will enable a multi-threaded version of the benchmark, where a rank may have more than one sub-domain (over-subscription), which are delegated to separate threads. Note: The gridtools computations use OpenMP threads on the CPU targets which will not be affected by this parameter.

Selecting the Transport Layer for GHEX

If the ghex_comm runtime has been selected, the underlying transport layer will be either UCX or MPI. The behaviour can be chosen by defining the the appropriate CMake variables, see below.

To enable UCX support, pass additionally the following flags

        -DGHEX_USE_UCP=ON \
        -DUCP_INCLUDE_DIR=/PATH/TO/UCX-INSTALLATION/include \
        -DUCP_LIBRARY=/PATH/TO/UCX-INSTALLATION/lib/libucp.so

If PMIx shall be enabled, follow the above pattern by define additionally

        -DGHEX_USE_PMIx=ON \
        -DPMIX_INCLUDE_DIR=/PATH/TO/PMIX-INSTALLATION/include \
        -DPMIX_LIBRARY=/PATH/TO/PMIX-INSTALLATION/lib/libpmix.so

To enable xpmem support, pass additionally the following flags

        -DGHEX_USE_XPMEM=ON \
        -DXPMEM_INCLUDE_DIR=/PATH/TO/XPMEM-INSTALLATION/include \
        -DXPMEM_LIBRARY=/PATH/TO/XPMEM-INSTALLATION/lib/libxpmem.so

Running the Benchmark

Benchmark

The benchmark executable requires the domain size as a command line parameter. The simulation will then be performed on a total domain size of NX×NY×NZ grid points. To launch the benchmark use the appropriate MPI launcher (mpirun, mpiexec, srun, or similar):

$ mpi_launcher <LAUNCHER_OPTIONS> ./benchmark --domain-size <NX> <NY> <NZ>

Example output of a single-node benchmark run:

Running GTBENCH
Domain size:             100x100x60
Floating-point type:     float
GridTools backend:       cuda
Runtime:                 single_node
Median time:             0.198082s (95% confidence: 0.19754s - 0.200368s)
Columns per second:      50484.1 (95% confidence: 49908.1 - 50622.6)

For testing, the number of runs (and thus the run time) can be reduced as follows:

$ mpi_launcher <LAUNCHER_OPTIONS> ./benchmark --domain-size <NX> <NY> <NZ> --runs <RUNS>

For example, run only once:

$ mpi_launcher ./benchmark --domain-size 24000 24000 60 --runs 1
Running GTBENCH
Domain size:             24000x24000x60
Floating-point type:     float
GridTools backend:       cuda
Runtime:                 ghex_comm
Median time:             8.97857s
Columns per second:      6.41528e+07

Note that no confidence intervals are given in this case.

Note that there may be additional runtime-dependent command line options. Use ./benchmark --help to list all available options.

Convergence Tests

To make sure that the solver converges to the analytical solution of the advection-diffusion equation, we provide convergence tests. They might be helpful for evaluating correctness after possible code changes in the computation or runtime, or the compiler optimization level. To run them, use:

$ mpi_launcher <LAUNCHER_OPTIONS> ./convergence_tests

The convergence tests can be run on 1, 2 or 4 MPI ranks.

Convergence orders should be close to the following theoretical numbers:

Computation	Spatial Order	Temporal Order
Horizontal Diffusion	6	1
Vertical Diffusion	2	1
Full Diffusion	2	1
Horizontal Advection	5	1
Vertical Advection	2	1
Runge-Kutta Advection	2	1

Note that the expected convergence orders of some tests do not exactly match the theoretical order. This is either due to limited numerical precision or to suboptimal range of tested spatial or temporal resolutions for a specific discretization. Deviations are expected to be larger when compiled with single-precision than with double-precision floating point numbers.

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
.github/workflows		.github/workflows
cmake		cmake
doc		doc
include/gtbench		include/gtbench
python		python
src		src
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE_HEADER		LICENSE_HEADER
README.md		README.md
RESULTS.md		RESULTS.md
version.txt		version.txt

License

GridTools/gtbench

Folders and files

Latest commit

History

Repository files navigation

GT Bench

Dependencies

Building

Selecting the GridTools Backend

Selecting the GPU Compilation Framework

Selecting the Runtime

Selecting the Transport Layer for GHEX

Running the Benchmark

Benchmark

Convergence Tests

About

Resources

License

Stars

Watchers

Forks

Languages