Skip to content

Distributed PyTorch

Albert Zeyer edited this page Nov 29, 2023 · 4 revisions

Distributed PyTorch will be enabled when torch_distributed is set in the config.

Example RETURNN setting: Just put torch_distributed = {} into the config. This will use PyTorch DistributedDataParallel. See PyTorch distributed overview or Getting started with distributed data parallel for details how this works and what PyTorch, NCCL or other relevant settings there are.

In i6_core ReturnnTrainingJob, set horovod_num_processes (name is confusing, it's not about Horovod anymore but also applies to other distribution frameworks) to the number of processes, and distributed_launch_cmd = "torchrun".

We call init_process_group(backend=None) which will by default enable both the Gloo backend and the NCCL backend. The Gloo backend will be used for CPU tensors and the NCCL backend will be used for GPU tensors. (I think) when NCCL fails, it will also fallback to Gloo for GPU tensors.

See NCCL env vars. E.g. NCCL_DEBUG=INFO could be useful.