Distributed PyTorch
Distributed PyTorch will be enabled when torch_distributed
is set in the config.
Example RETURNN setting: Just put torch_distributed = {}
into the config. This will use PyTorch DistributedDataParallel
.
See PyTorch distributed overview
or Getting started with distributed data parallel
for details how this works and what PyTorch, NCCL or other relevant settings there are.
In i6_core ReturnnTrainingJob
, set horovod_num_processes
(name is confusing, it's not about Horovod anymore but also applies to other distribution frameworks) to the number of processes, and distributed_launch_cmd = "torchrun"
.
We call init_process_group(backend=None)
which will by default enable both the Gloo backend and the NCCL backend. The Gloo backend will be used for CPU tensors and the NCCL backend will be used for GPU tensors. (I think) when NCCL fails, it will also fallback to Gloo for GPU tensors.
See NCCL env vars. E.g. NCCL_DEBUG=INFO
could be useful.