Skip to content

PawseySC/Reframe-MPI-Stress-Tests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reframe-MPI-Stress-Tests

A collection of tests to be run using Reframe.

Types of tests

The tests fall under four broad categories. A brief overview is given here, with more detail about the individual tests is provided in the README in the corresponding directories.

MPI

A collection of MPI tests. These tests cover various types of MPI communication (point-to-point and collective) in various workflows designed to more closely replicate typical users workflows seen on our HPC system. They are also written in such a way that they can serve as stress tests for the system, able to be run at large scale with many processes across many nodes, which is often atypical of standard MPI tests run on HPC systems (e.g. OSU microbenchmarks). Also included are tests which cover specific issues we have encoountered with our MPI implementation. The tests cover usage of MPI in CPU-only code as well as GPU-enabled code. All MPI tests also include a node health check at the beginning and end of each run and logging of environment information through the settiong of certain environment variables.

SLURM

A collection of SLURM tests. These tests cover particular issues, problems, and strange behaviour we have seen with our SLURM installation(s), so may not be as widely applicable on other sysrtems as the MPI and compilation/performance tests. Nevertheless, they serve as an example of the type of tests which could be used to monitor certain aspects of a centre's SLURM installation. These tests cover resource allocation vs. request (i.e. ensuring what a user asks for in terms of accessible memory, cores, etc. is what is actually provided to them), account billing, affinity of OMP threads and MPI processes, and node configuration with respect to SLURM config.

The CPU and GPU nodes on our system have different hardware and setups, and as such the SLURM setup on the two sets of nodes is not the same. Therefore, as with the MPI tests, we have tests for SLURM use on both CPU and GPU nodes.

Compilation/Performance

A collection of tests which focussed on compilers and the affect of different compile-time options, flags, etc. on the performance of the compiled code. By running the tests with multiple different combinations of compilers, flags, etc. one can investigate the effects with these tests.

Software stack

A collection of tests which focus on installation of software and libraries in a typical HPC software stack. In our case, we use the spack package manager for installing our packages. These tests primarily check the correct generation and functionality of lmod module files generated by spack during the software installation process. Another test checks for correct concretisation of every defined abstract spec.

How to run tests

To run the tests one first needs to have a working ReFrame on their system - details on the options are available at https://reframe-hpc.readthedocs.io/en/v3.12.0/started.html. After a few steps required to configure the tests for your system (see next section) the tests can then be run by manually invoking ReFrame via a command of the form

reframe -C ${PATH_TO_THIS_REPO_ROOT_DIR}/setup_files/settings.py -c $PATH_TO_TEST_FILE_OR_DIRECTORY -r [-t $TAG -n $TEST_NAME --performance-report ...]

Alternatively, one can utilise the script located at tests/run_tests.sh. This script runs a subset of tests based on the argument provided. All tests run in this script have the --performance-report reframe command-line argument set so that performance metrics are printed to the terminal in addition to being stored in performance log files. Below are examples of its usage:

# Run all tests in this repo
./tests/run_tests.sh -a

# Run all tests in a particular file  (path defined relative to /tests directory)
./tests/run_tests.sh -f mpi/mpi_checks.py

# Run all tests of a particular category/tag
./tests/run_tests.sh -t slurm

# Run a single specific test (pass the test name as defined in the test file)
./tests/run_tests.sh -n Pt2Pt

# Run several specific tests (pass the list of test names as a '|'-delimited list within double quotes "")
./tests/run_tests.sh -n "Pt2Pt|CollectiveComms"

# Run tests with extra (optional) arguments passed to Reframe (pass ',' separated string of options within double quotes "")
./tests/run_tests.sh -f slurm/slurm_gpu_checks -o "--output=DIR, --keep-stage-files, --perflogdir=DIR"

A separate script, run_spack_tests.sh is dedicated to the spack tests included in this repo. These tests make use of modules and system software/libraries which have been installed on the local system they are being run on, therefore will not work straightaway. They require a bit more work, and so a separate run script is set aside for them. Its usage is as follows:

# If all spack packages are in a single build environment (pass no arguments)
./tests/run_spack_tests.sh

# If spack packages are spread across many build environments (pass an argument specifying the list of environments)
env_list="
utils
num_libs
python
io_libs
langs
apps
devel
bench
s3_clients
astro
bio
"
./tests/run_spack_tests.sh  -e "${env_list[@]}"

# Alternate format of passing list of environments
./tests/run_spack_tests.sh -e "num_libs python apps"

# Run tests with extra (optional) arguments passed to Reframe (pass ',' separated string of options within double quotes "")
./tests/run_spack_tests.sh -o "--output=DIR, --keep-stage-files, --perflogdir=DIR"

Configuring the tests for your system

The tests have been written in a generalised way to try and make them as portable as possible. Nevertheless, there are some steps required to run the tests on your system. First, the settings.py configuration file will need to be modified for the system one will be running the tests on. A dummy system entry has been provided.

Additionally, each test file has an associated .yaml configuration file. This file includes system configuration parameters, job options to pass to the scheduler, and test-specific parameters. With the default entries the tests run "out of the box" on Setonix. For other systems, one will need to alter the system configuration options, such as system, prog-environ, and account. An example of the relevant part of one of the configuration files is shown below:

# System level parameters required for the Reframe tests
system-parameters: 
  system: # Value for valid_systems in Reframe test class
    - setonix:work
  prog-environ: # Value for valid_prog_environs in Reframe test class
    - PrgEnv-gnu
# Configure the environment
environment:
  # MPI-related environment settings
  mpi:
    # Any modules to load
    modules: ~
    # Any commands to execute to configure environment
    commands: ~
    # Environment variables
    env-vars: {MPICH_ENV_DISPLAY: '1', MPICH_MEMORY_REPORT: '1', MPICH_OFI_VERBOSE: '1'}
# Job options to pass to job script
job-options:
  account: pawsey0001 # Account to charge for running job

On Setonix we use the SLURM job scheduler, so if you are using a different one, more modifications will be required to run the tests due to differing naming conventions and differences in the level of integration of various job schedulers within ReFrame.

About

A collection of MPI Stress tests to be run using Reframe

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published