Skip to content

CSCfi/pytorch-ddp-examples

Repository files navigation

PyTorch multi-GPU and multi-node examples for CSC's supercomputers

PyTorch distributed and in particular DistributedDataParallel (DDP), offers a nice way of running multi-GPU and multi-node PyTorch jobs. Unfortunately, the PyTorch documentation has been a bit lacking in this area, and examples found online can often be out-of-date.

To make usage of DDP on CSC's Supercomputers easier, we have created a set of examples on how to run simple DDP jobs on the cluster. Included are also examples with other frameworks, such as PyTorch Lightning and DeepSpeed.

All examples train a simple CNN on MNIST. Scripts have been provided for the Puhti supercomputer, but can be used on other systems with minor modifications.

For larger examples, see also our Machine learning benchmarks repository.

Finally, you might also be interested in CSC's machine learning guide and in particular the section on Multi-GPU and multi-node machinelearing.

Multi-GPU, single-node

The simplest case is using all four GPUs on a single node on Puhti.

sbatch run-ddp-gpu4.sh

Multi-GPU, multi-node

Example using two nodes, four GPUs on each giving a total of 8 GPUs (again, on Puhti):

sbatch run-ddp-gpu8.sh

PyTorch Lightning examples

Multi-GPU and multi-node jobs are even easier with PyTorch Lightning. The official PyTorch Lightning now has a relatively good Slurm documentation, although it has to be modified a bit for Puhti.

Four GPUs on single node on Puhti:

sbatch run-lightning-gpu4.sh

Two nodes, 8 GPUs in total on Puhti:

sbatch run-lightning-gpu8.sh

DeepSpeed examples

DeepSpeed should work on Puhti and Mahti with the PyTorch module (from version 1.10 onwards).

Single-node with four GPUs (Puhti):

sbatch run-deepspeed-gpu4.sh

Here we are using Slurm to launch a single process which uses DeepSpeed's launcher to launch four processes (one for each GPU).

Two nodes, 8 GPUs in total (Puhti):

sbatch run-deepspeed-gpu8.sh

Note that we are using Slurm's srun to launch four processess on each node (one per GPU), and instead of DeepSpeed's launcher we are relying on MPI to provide it the information it needs to communicate between all the processes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published