DDP: NCCL " The server socket has failed to bind to..." #13264

drscotthawley · 2022-06-10T03:23:03Z

drscotthawley
Jun 10, 2022

Hi, I'm trying to use my Pytorch Lightning code in conjunction with Jukebox which has its own set of routines for distributed training via the torch.distributed.run method. I have read the PyTorchLightning docs on torch.distributed and followed them as far as I know:

python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 --rdzv_id=31459 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 ./my_script.py --myarg1=thing  ...etc

If I only ever run on 1 GPU there's no problem, but when I try to run on more than 1 GPU via DDP, then I get many errors from NCCL such as

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Caught error during NCCL init (attempt 0 of 5): The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

"Already in use": Presumably Jukebox, which it runs its own MPI initialization in the form of

from jukebox.utils.dist_utils import setup_dist_from_mpi
...
rank, local_rank, device = setup_dist_from_mpi()

(^^ This call is inside my TRAINER module, BTW, so it should be AFTER Lightning sets up the init_process_group(), right?)

...Jukebox is trying to setup re-reserve the slots ALREADY setup/reserved by the Lightning Trainer when the DDP Spawn routine in PytorchLightning itself already calls torch.distributed.init_process_group() ; and rather than just polling the os. environment for keys like RANK and MASTER_ADDR, Jukebox is ignoring those for now.

My question:

If PyTorch Lightning is setting the torch.distributed.dist object already, then is there a way I can get access to it? ( for interfacing with the Jukebox code?)

(or can I call init_process_group() myself or obtain the result from when PyTorch Lightning called it?)

Because right now, if I try NOT calling their MPI initialization routine setup_dist_from_mpi() and instead just communicate key values based on environment variables a la:

        rank, local_rank, device = os.getenv('RANK'), os.getenv('RANK'), self.device

...then whereever the Jukebox code calls something like dist.barrier(), then I get an error about

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1102, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

...
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group....

But I thought Lightning was supposedly calling init_process_group() already?....?
So I'm confused. Any tips?

UPDATE: Before the call to the Jukebox stuff, I did check and the pytorch distributed dist.is_available() is True, so it looks like Lightning may have done something already by that point. But in that case I'm confused about why we're still seeing that RuntimeError about process group not being initialized:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Answered by drscotthawley

Jun 10, 2022

MY SOLUTION:

I think I've distilled it to two simple parts. Default values all ended up being ok, and no special environment-variable-setting proved necessary (e.g. I unset all the NCCL flags I'd tried earlier). Two things:

running as instructed with the torch.distributed.run as before. Although the default values were fine. e.g.

python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 ./myscript.py --arg1=thing ...etc

Near the top of my Trainer init code (and before the Jukebox stuff), initialize :

dist.init_process_group(backend="nccl")

no other parts were essential. And I could either have the trainer strategy set to "ddp" or "fsdp" or nothing at all; made no difference.

ALT…

View full answer

drscotthawley · 2022-06-10T04:45:33Z

drscotthawley
Jun 10, 2022
Author

Solved it. ...Not quite sure which of the numerous things I tried was they key fix, but I can come back and post that.
For now, just want to save the trouble for anyone who might have devoted time to helping me.

0 replies

drscotthawley · 2022-06-10T05:56:49Z

drscotthawley
Jun 10, 2022
Author

MY SOLUTION:

I think I've distilled it to two simple parts. Default values all ended up being ok, and no special environment-variable-setting proved necessary (e.g. I unset all the NCCL flags I'd tried earlier). Two things:

running as instructed with the torch.distributed.run as before. Although the default values were fine. e.g.

python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 ./myscript.py --arg1=thing ...etc

Near the top of my Trainer init code (and before the Jukebox stuff), initialize :

dist.init_process_group(backend="nccl")

no other parts were essential. And I could either have the trainer strategy set to "ddp" or "fsdp" or nothing at all; made no difference.

ALTHOUGH, one extra other thing that makes it go even faster: For some reason OMP_NUM_THREADS is not being set and so you see a warning message that it's getting set to 1 by default. No need to leave that! So my final, well-performing invocation looks like:

OMP_NUM_THREADS=12 python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 ./myscript.py --arg1=thing ...etc

one could also permanenty set export OMP_NUM_THREADS=12 but I haven't bothered to do that yet.

0 replies

jinx2018 · 2023-06-20T07:39:20Z

jinx2018
Jun 20, 2023

Setting another port like the following scripts

python -m torch.distributed.launch --nproc_per_node=8 --master_port=25678 train.py

2 replies

CathyLin1219 Jul 19, 2023

not master_port , main_process_port

master-chou Apr 19, 2024

It works for me，thanks！！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP: NCCL " The server socket has failed to bind to..." #13264

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DDP: NCCL " The server socket has failed to bind to..." #13264

drscotthawley Jun 10, 2022

My question:

MY SOLUTION:

Replies: 3 comments · 2 replies

drscotthawley Jun 10, 2022 Author

drscotthawley Jun 10, 2022 Author

MY SOLUTION:

jinx2018 Jun 20, 2023

CathyLin1219 Jul 19, 2023

master-chou Apr 19, 2024

drscotthawley
Jun 10, 2022

Replies: 3 comments 2 replies

drscotthawley
Jun 10, 2022
Author

drscotthawley
Jun 10, 2022
Author

jinx2018
Jun 20, 2023