Skip to content

DDP: NCCL " The server socket has failed to bind to..." #13264

Discussion options

You must be logged in to vote

MY SOLUTION:

I think I've distilled it to two simple parts. Default values all ended up being ok, and no special environment-variable-setting proved necessary (e.g. I unset all the NCCL flags I'd tried earlier). Two things:

  1. running as instructed with the torch.distributed.run as before. Although the default values were fine. e.g.
python -m torch.distributed.run --nnodes=1 --nproc_per_node=8 ./myscript.py --arg1=thing ...etc
  1. Near the top of my Trainer init code (and before the Jukebox stuff), initialize :
dist.init_process_group(backend="nccl")

no other parts were essential. And I could either have the trainer strategy set to "ddp" or "fsdp" or nothing at all; made no difference.

  1. ALT…

Replies: 3 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by drscotthawley
Comment options

You must be logged in to vote
2 replies
@CathyLin1219
Comment options

@master-chou
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strategy: ddp DistributedDataParallel
4 participants