DDP: NCCL " The server socket has failed to bind to..." #13264
-
Hi, I'm trying to use my Pytorch Lightning code in conjunction with Jukebox which has its own set of routines for distributed training via the python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 --rdzv_id=31459 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 ./my_script.py --myarg1=thing ...etc If I only ever run on 1 GPU there's no problem, but when I try to run on more than 1 GPU via DDP, then I get many errors from NCCL such as
"Already in use": Presumably Jukebox, which it runs its own MPI initialization in the form of from jukebox.utils.dist_utils import setup_dist_from_mpi
...
rank, local_rank, device = setup_dist_from_mpi() (^^ This call is inside my TRAINER module, BTW, so it should be AFTER Lightning sets up the ...Jukebox is trying to setup re-reserve the slots ALREADY setup/reserved by the Lightning Trainer when the DDP Spawn routine in PytorchLightning itself already calls My question:If PyTorch Lightning is setting the (or can I call Because right now, if I try NOT calling their MPI initialization routine rank, local_rank, device = os.getenv('RANK'), os.getenv('RANK'), self.device ...then whereever the Jukebox code calls something like
But I thought Lightning was supposedly calling init_process_group() already?....? UPDATE: Before the call to the Jukebox stuff, I did check and the pytorch distributed
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Solved it. ...Not quite sure which of the numerous things I tried was they key fix, but I can come back and post that. |
Beta Was this translation helpful? Give feedback.
-
MY SOLUTION:I think I've distilled it to two simple parts. Default values all ended up being ok, and no special environment-variable-setting proved necessary (e.g. I unset all the NCCL flags I'd tried earlier). Two things:
dist.init_process_group(backend="nccl") no other parts were essential. And I could either have the trainer strategy set to "ddp" or "fsdp" or nothing at all; made no difference.
one could also permanenty set |
Beta Was this translation helpful? Give feedback.
-
Setting another port like the following scripts python -m torch.distributed.launch --nproc_per_node=8 --master_port=25678 train.py |
Beta Was this translation helpful? Give feedback.
MY SOLUTION:
I think I've distilled it to two simple parts. Default values all ended up being ok, and no special environment-variable-setting proved necessary (e.g. I unset all the NCCL flags I'd tried earlier). Two things:
torch.distributed.run
as before. Although the default values were fine. e.g.no other parts were essential. And I could either have the trainer strategy set to "ddp" or "fsdp" or nothing at all; made no difference.