Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclSystemError: Cannot assign requested address #1466

Open
albertz opened this issue Nov 27, 2023 · 7 comments
Open

ncclSystemError: Cannot assign requested address #1466

albertz opened this issue Nov 27, 2023 · 7 comments

Comments

@albertz
Copy link
Member

albertz commented Nov 27, 2023

In PyTorch distributed training, I get:

  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 198, in Engine.init_train_from_config
    line: self._ddp_pt_model = self._torch_distributed_class(
              self._pt_model, device_ids=get_device_ids(), **self._torch_distributed_options
          )
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in DistributedDataParallel.__init__
    line: _verify_param_shape_across_processes(self.process_group, parameters)
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes 
    line: return dist._verify_params_across_processes(process_group, tensors, logger)
...
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::ba59:9f03:fc:765c%ib0<57829> failed : Cannot assign requested address

Maybe related to that:

Originally posted by @albertz in rwth-i6/i6_core#459 (comment)

@albertz
Copy link
Member Author

albertz commented Nov 27, 2023

Now slightly different:

DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::ba59:9f03:fc:765c%ib0<37295> failed : Software caused connection abort

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

With NCCL_DEBUG=INFO, some more debug info:

Filtering only rank 0:

nd20-01:228985:228985 [0] NCCL INFO Bootstrap : Using bond0:134.61.201.231<0> 
nd20-01:228985:228985 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
nd20-01:228985:228985 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
nd20-01:228985:228985 [0] NCCL INFO cudaDriverVersion 12000
nd20-01:228985:255989 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [RO]; OOB bond0:134.61.201.231<0> 
nd20-01:228985:255989 [0] NCCL INFO Using network IB
nd20-01:228985:255989 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff 
nd20-01:228985:255989 [0] NCCL INFO NVLS multicast support is not available on dev 0
nd20-01:228985:255989 [0] NCCL INFO Channel 00/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
nd20-01:228985:255989 [0] NCCL INFO Channel 01/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
nd20-01:228985:255989 [0] NCCL INFO Channel 02/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
nd20-01:228985:255989 [0] NCCL INFO Channel 03/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
nd20-01:228985:255989 [0] NCCL INFO Channel 04/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
nd20-01:228985:255989 [0] NCCL INFO Channel 05/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
nd20-01:228985:255989 [0] NCCL INFO Channel 06/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
nd20-01:228985:255989 [0] NCCL INFO Channel 07/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
nd20-01:228985:255989 [0] NCCL INFO Channel 08/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
nd20-01:228985:255989 [0] NCCL INFO Channel 09/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
nd20-01:228985:255989 [0] NCCL INFO Channel 10/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
nd20-01:228985:255989 [0] NCCL INFO Channel 11/12 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
nd20-01:228985:255989 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1
nd20-01:228985:255989 [0] NCCL INFO P2P Chunksize set to 524288
nd20-01:228985:255989 [0] NCCL INFO Channel 00/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 01/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 02/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 03/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 04/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 05/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 06/0 : 0[34000] -> 1[36000] via P2P/IPC 
nd20-01:228985:255989 [0] NCCL INFO Channel 07/0 : 0[34000] -> 1[36000] via P2P/IPC 
nd20-01:228985:255989 [0] NCCL INFO Channel 08/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 09/0 : 0[34000] -> 1[36000] via P2P/IPC 
nd20-01:228985:255989 [0] NCCL INFO Channel 10/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Channel 11/0 : 0[34000] -> 1[36000] via P2P/IPC
nd20-01:228985:255989 [0] NCCL INFO Connected all rings 
nd20-01:228985:255989 [0] NCCL INFO Connected all trees
nd20-01:228985:255989 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
nd20-01:228985:255989 [0] NCCL INFO 12 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
nd20-01:228985:255989 [0] NCCL INFO comm 0x14466150 rank 0 nranks 16 cudaDev 0 busId 34000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE

Last debug info of all ranks:

nd20-01:228993:256419 [8] NCCL INFO comm 0x6606dfe0 rank 8 nranks 16 cudaDev 8 busId b7000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228985:255989 [0] NCCL INFO comm 0x14466150 rank 0 nranks 16 cudaDev 0 busId 34000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228991:256393 [6] NCCL INFO comm 0x660707b0 rank 6 nranks 16 cudaDev 6 busId 5c000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228999:256371 [14] NCCL INFO comm 0x6609c220 rank 14 nranks 16 cudaDev 14 busId e5000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228997:256672 [12] NCCL INFO comm 0x6606fdf0 rank 12 nranks 16 cudaDev 12 busId e0000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228989:256385 [4] NCCL INFO comm 0x6606d980 rank 4 nranks 16 cudaDev 4 busId 57000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228996:256375 [11] NCCL INFO comm 0x6606ee50 rank 11 nranks 16 cudaDev 11 busId be000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228988:256388 [3] NCCL INFO comm 0x6606ff20 rank 3 nranks 16 cudaDev 3 busId 3b000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228998:256233 [13] NCCL INFO comm 0x660720b0 rank 13 nranks 16 cudaDev 13 busId e2000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228990:256329 [5] NCCL INFO comm 0x660717c0 rank 5 nranks 16 cudaDev 5 busId 59000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228994:256369 [9] NCCL INFO comm 0x66073000 rank 9 nranks 16 cudaDev 9 busId b9000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228986:256612 [1] NCCL INFO comm 0x6606a7e0 rank 1 nranks 16 cudaDev 1 busId 36000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:229000:256326 [15] NCCL INFO comm 0x660715b0 rank 15 nranks 16 cudaDev 15 busId e7000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228992:256389 [7] NCCL INFO comm 0x66075ee0 rank 7 nranks 16 cudaDev 7 busId 5e000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228987:258116 [2] NCCL INFO comm 0x6606e810 rank 2 nranks 16 cudaDev 2 busId 39000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE
nd20-01:228995:256381 [10] NCCL INFO comm 0x66075980 rank 10 nranks 16 cudaDev 10 busId bc000 commId 0x5b84c6aa4e36daa2 - Init COMPLETE

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

Note that there was a small change w.r.t. the DDP module wrapping, see commit 90c7548, #1451. I'm not sure if this affected the reported error here, as this error was much earlier, in Engine.init_train_from_config / DistributedDataParallel.__init__.

@albertz
Copy link
Member Author

albertz commented Nov 28, 2023

Now after a restart (also the ITC restarted the nodes), I'm not sure if I see the same error. I had another error in my setup to use Torch AMP with bfloat16, but the V100 GPU does not support that. After fixing that, I had another bug with CUDA device mixup, now also fixed (200b7a4). Now I get a GPU OOM error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 52.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 39.62 MiB is free. Process 101553 has 1.19 GiB memory in use. Process 101552 has 1.19 GiB memory in use. Process 101563 has 1.19 GiB memory in use. Process 101549 has 1.19 GiB memory in use. Process 101555 has 1.19 GiB memory in use. Process 101558 has 1.19 GiB memory in use. Process 101557 has 1.19 GiB memory in use. Process 101562 has 1.19 GiB memory in use. Process 101556 has 1.19 GiB memory in use. Process 101560 has 1.19 GiB memory in use. Process 101554 has 1.19 GiB memory in use. Process 101550 has 1.19 GiB memory in use. Process 101561 has 1.19 GiB memory in use. Process 101559 has 1.19 GiB memory in use. Process 101551 has 1.19 GiB memory in use. Including non-PyTorch memory, this process has 13.82 GiB memory in use. Of the allocated memory 11.54 GiB is allocated by PyTorch, and 677.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The procs correspond to all the other procs, as you see here:

nd20-01:101560:130074 [12] NCCL INFO comm 0x66a0bb80 rank 12 nranks 16 cudaDev 12 busId e0000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101552:130510 [4] NCCL INFO comm 0x66a0da90 rank 4 nranks 16 cudaDev 4 busId 57000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101556:132204 [8] NCCL INFO comm 0x66a112d0 rank 8 nranks 16 cudaDev 8 busId b7000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101562:130451 [14] NCCL INFO comm 0x66a125d0 rank 14 nranks 16 cudaDev 14 busId e5000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101548:128664 [0] NCCL INFO comm 0x1553b500 rank 0 nranks 16 cudaDev 0 busId 34000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101558:130725 [10] NCCL INFO comm 0x66a2d8f0 rank 10 nranks 16 cudaDev 10 busId bc000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101554:132196 [6] NCCL INFO comm 0x66a0dfa0 rank 6 nranks 16 cudaDev 6 busId 5c000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101550:130068 [2] NCCL INFO comm 0x66a0f950 rank 2 nranks 16 cudaDev 2 busId 39000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101551:130072 [3] NCCL INFO comm 0x66a13ce0 rank 3 nranks 16 cudaDev 3 busId 3b000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101559:130070 [11] NCCL INFO comm 0x66a0aad0 rank 11 nranks 16 cudaDev 11 busId be000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101563:130080 [15] NCCL INFO comm 0x66a16140 rank 15 nranks 16 cudaDev 15 busId e7000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101555:130073 [7] NCCL INFO comm 0x66a12270 rank 7 nranks 16 cudaDev 7 busId 5e000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101553:132201 [5] NCCL INFO comm 0x66a08610 rank 5 nranks 16 cudaDev 5 busId 59000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101561:132172 [13] NCCL INFO comm 0x66a09420 rank 13 nranks 16 cudaDev 13 busId e2000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101549:130082 [1] NCCL INFO comm 0x66a0e910 rank 1 nranks 16 cudaDev 1 busId 36000 commId 0x66b9ae567fdb3d69 - Init COMPLETE
nd20-01:101557:130069 [9] NCCL INFO comm 0x66a14580 rank 9 nranks 16 cudaDev 9 busId b9000 commId 0x66b9ae567fdb3d69 - Init COMPLETE

I wonder a bit about this. It means every process reserved some memory on every other GPU. This seems suboptimal?
Edit I posted that also separately here: #1469

But this is probably unrelated to the original issue here.

And additionally:

...
  File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    line: Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
              tensors,
              grad_tensors_,
              retain_graph,
              create_graph,
              inputs,
              allow_unreachable=True,
              accumulate_grad=True,
          )  # Calls into the C++ engine to run the backward pass
    locals:
      Variable = <global> <class 'torch.autograd.variable.Variable'>
      Variable._execution_engine = <global> <torch._C._EngineBase object at 0x14be0f944050>
      Variable._execution_engine.run_backward = <global> <built-in method run_backward of torch._C._EngineBase object at 0x14be0f944050>
RuntimeError: GET was unable to find an engine to execute this computation

This might just be a followup error of the OOM, but I'm not sure.

@albertz
Copy link
Member Author

albertz commented Nov 29, 2023

Via #1470, similar (Cannot assign requested address) but also a bit different:

nd20-01:275975:275975 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory                                                                                                                           
nd20-01:275975:275975 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation                                                
nd20-01:275975:275975 [0] misc/socket.cc:379 NCCL WARN Call to bind failed : Cannot assign requested address                                   
nd20-01:275975:275975 [0] NCCL INFO bootstrap.cc:176 -> 2                                                                                      
nd20-01:275975:275975 [0] NCCL INFO bootstrap.cc:201 -> 2
Traceback (most recent call last):
  File "/home/az668407/setups/combined/2021-05-31/tools/playground/torch-distributed-demo.py", line 53, in <module>                            
    ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
  File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__      
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)                                                                
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1251, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.                                            
Last error:
Call to bind failed : Cannot assign requested address
Traceback (most recent call last):
  File "/home/az668407/setups/combined/2021-05-31/tools/playground/torch-distributed-demo.py", line 53, in <module>                            
    ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
  File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in __init__      
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/az668407/work/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0')
got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
[2023-11-29 18:07:07,192] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 275975) of binary: /home/az668407/work/py-envs/py3.10-torch2.1/bin/python3.10

This is actually with the demo from here (also here), but the same error happens also with RETURNN.

This also seems non-deterministic. After a restart of the job, even running torchrun just again seem to fix it sometimes. When this demo runs without error, it looks like:

/home/az668407/work/py-envs/py3.10-torch2.1/bin/python3.10 -m torch.distributed.run --standalone --nnodes 1 --nproc-per-node=2 ~/setup
s/combined/2021-05-31/tools/playground/torch-distributed-demo.py                                                                               
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.                                                      
Start running torch distributed training on local rank 1/2.                                                                                    
*** start {                                                                                                                                    
Start running torch distributed training on local rank 0/2.                                                                                    
*** start -- 128201 }
*** after model init {
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10      752MiB |
*** after model init -- 128201 } 
nd20-01:128200:128200 [0] NCCL INFO Bootstrap : Using bond0:134.61.201.231<0>
nd20-01:128200:128200 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No su
ch file or directory
nd20-01:128200:128200 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
nd20-01:128200:128200 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.1+cuda12.1
nd20-01:128201:128201 [1] NCCL INFO cudaDriverVersion 12000
nd20-01:128201:128201 [1] NCCL INFO Bootstrap : Using bond0:134.61.201.231<0>
nd20-01:128201:128201 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No su
ch file or directory
nd20-01:128201:128201 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
nd20-01:128200:128783 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_bond_0:1/RoCE [RO]; OOB bond0:134.61.201.231<0>
nd20-01:128200:128783 [0] NCCL INFO Using network IB
nd20-01:128201:128785 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_bond_0:1/RoCE [RO]; OOB bond0:134.61.201.231<0>
nd20-01:128201:128785 [1] NCCL INFO Using network IB
nd20-01:128200:128783 [0] NCCL INFO Channel 00/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 01/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 02/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 03/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 04/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 05/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 06/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 07/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 08/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 09/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 10/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Channel 11/12 :    0   1
nd20-01:128200:128783 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1 [4] -1/-1/-1->0->1 [5] -1/
-1/-1->0->1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1
nd20-01:128200:128783 [0] NCCL INFO P2P Chunksize set to 524288
nd20-01:128201:128785 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1 [4] 0/-1/-1->1->-1 [5] 0/-
1/-1->1->-1 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1
nd20-01:128201:128785 [1] NCCL INFO P2P Chunksize set to 524288
nd20-01:128201:128785 [1] NCCL INFO Channel 00/0 : 1[e0000] -> 0[be000] via P2P/IPC                                                             
nd20-01:128200:128783 [0] NCCL INFO Channel 00/0 : 0[be000] -> 1[e0000] via P2P/IPC                                                             
nd20-01:128201:128785 [1] NCCL INFO Channel 01/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 01/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 02/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 02/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 03/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 03/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 04/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 04/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 05/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 05/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 06/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 06/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 07/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 07/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 08/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 08/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 09/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 09/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 10/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 10/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128201:128785 [1] NCCL INFO Channel 11/0 : 1[e0000] -> 0[be000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Channel 11/0 : 0[be000] -> 1[e0000] via P2P/IPC
nd20-01:128200:128783 [0] NCCL INFO Connected all rings
nd20-01:128200:128783 [0] NCCL INFO Connected all trees
nd20-01:128200:128783 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nd20-01:128200:128783 [0] NCCL INFO 12 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
nd20-01:128201:128785 [1] NCCL INFO Connected all rings
nd20-01:128201:128785 [1] NCCL INFO Connected all trees
nd20-01:128201:128785 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
nd20-01:128201:128785 [1] NCCL INFO 12 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
nd20-01:128201:128785 [1] NCCL INFO comm 0x39976e40 rank 1 nranks 2 cudaDev 1 busId e0000 commId 0x5c8dc52fed942058 - Init COMPLETE
nd20-01:128200:128783 [0] NCCL INFO comm 0x39977620 rank 0 nranks 2 cudaDev 0 busId be000 commId 0x5c8dc52fed942058 - Init COMPLETE
*** after DDP wrapping {
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10     1114MiB |
*** after DDP wrapping -- 128201 }
*** after optimizer init {
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10     1114MiB |
*** after optimizer init -- 128201 }
[0] step 0 
[1] step 0 
*** step 0 {  
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10     1298MiB |
*** step 0 -- 128201 }
[0] step 1                                                                                                                         
[1] step 1              
*** step 1 {                                                                   
[0] step 2                        
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10     1298MiB |
*** step 1 -- 128201 }                                                         
[1] step 2                          
*** step 2 {
[0] step 3 
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10     1298MiB |
*** step 2 -- 128201 }                                                         
[1] step 3            
*** step 3 {
|    1   N/A  N/A    128201      C   ...0-torch2.1/bin/python3.10     1298MiB |
*** step 3 -- 128201 }

It's really strange. Sometimes it also hangs after this output:

NCCL version 2.18.1+cuda12.1
nd20-01:139029:139029 [1] NCCL INFO cudaDriverVersion 12000
nd20-01:139029:139029 [1] NCCL INFO Bootstrap : Using ib0:fe80::ba59:9f03:fc:765c%ib0<0>
nd20-01:139029:139029 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
nd20-01:139029:139029 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
nd20-01:139028:139543 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_bond_0:1/RoCE [RO]; OOB ib0:fe80::ba59:9f03:fc:765c%ib0<0>
nd20-01:139028:139543 [0] NCCL INFO Using network IB
nd20-01:139029:139551 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_bond_0:1/RoCE [RO]; OOB ib0:fe80::ba59:9f03:fc:765c%ib0<0>
nd20-01:139029:139551 [1] NCCL INFO Using network IB

In dmesg, I see such output, which is maybe related:

[  +1.663969] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.61.24.84 DST=134.61.201.231 LEN=40 TO
S=0x00 PREC=0x00 TTL=59 ID=41524 DF PROTO=TCP SPT=3128 DPT=50344 WINDOW=0 RES=0x00 RST URGP=0 
[Nov29 18:38] LNetError: 121181:0:(lib-move.c:2245:lnet_handle_find_routed_path()) no route to 134.61.192.190@tcp16 from <?>
[  +0.000930] LNetError: 121181:0:(lib-move.c:2245:lnet_handle_find_routed_path()) Skipped 2343 previous similar messages
[  +0.000615] LNetError: 121181:0:(lib-move.c:3943:lnet_handle_recovery_reply()) peer NI (134.61.192.190@tcp16) recovery failed with -113
[  +0.000488] LNetError: 121181:0:(lib-move.c:3943:lnet_handle_recovery_reply()) Skipped 2343 previous similar messages
[  +5.132969] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[  +0.002149] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready 
[  +4.507905] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.130.30.215 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=51705 DF PROTO=TCP SPT=3128 DPT=40178 WINDOW=0 RES=0x00 RST URGP=0  
[  +0.211213] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.130.30.215 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=51788 DF PROTO=TCP SPT=3128 DPT=40178 WINDOW=0 RES=0x00 RST URGP=0  
[  +0.216192] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.130.30.215 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=51996 DF PROTO=TCP SPT=3128 DPT=40178 WINDOW=0 RES=0x00 RST URGP=0  
[  +0.423833] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.130.30.215 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=52399 DF PROTO=TCP SPT=3128 DPT=40178 WINDOW=0 RES=0x00 RST URGP=0  
[  +0.896069] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.130.30.215 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=52963 DF PROTO=TCP SPT=3128 DPT=40178 WINDOW=0 RES=0x00 RST URGP=0  
[  +1.728049] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.130.30.215 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=58 ID=53170 DF PROTO=TCP SPT=3128 DPT=40178 WINDOW=0 RES=0x00 RST URGP=0 
[Nov29 18:39] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[  +0.002119] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[Nov29 18:40] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[  +0.002220] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[ +46.022961] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
[  +0.002136] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready 
[Nov29 18:41] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.61.24.84 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=59 ID=33125 DF PROTO=TCP SPT=3128 DPT=54742 WINDOW=0 RES=0x00 RST URGP=0  
[  +0.201080] "filter_IN_public_DROP: "IN=bond0 OUT= MAC=b8:59:9f:ff:f6:96:00:de:fb:1d:51:c2:08:00 SRC=134.61.24.84 DST=134.61.201.231 LEN=40 TOS=0x00 PREC=0x00 TTL=59 ID=33243 DF PROTO=TCP SPT=3128 DPT=54742 WINDOW=0 RES=0x00 RST URGP=0  

Also, nvidia-smi looks like this:

+-----------------------------------------------------------------------------+                                                                
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |                                                                
|-------------------------------+----------------------+----------------------+                                                                
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:BE:00.0 Off |                    0 |
| N/A   52C    P0    89W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM3...  On   | 00000000:E0:00.0 Off |                    0 |
| N/A   31C    P0    81W / 350W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The N/A is a bit strange?

@albertz
Copy link
Member Author

albertz commented Nov 29, 2023

So, I think now, in case this problem happens on the IB, this is maybe a (non-deterministic) hardware issue.

@albertz
Copy link
Member Author

albertz commented Dec 1, 2023

Again on DGX, I'm debugging probably the same or a similar issue.

Torch distributed initialized. Hostname nd20-01.hpc.itc.rwth-aachen.de, pid 191191, rank 0 / size 16, local rank 0 / local size 16.
RETURNN starting up, version 1.20231130.103604+git.5878e5f9, date/time 2023-12-01-10-30-00 (UTC+0100), pid 191191, cwd /rwthfs/rz/cluster/hpcw
ork/az668407/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.KHDgdQNqanKz/work, Python /home/az668407/work/py
-envs/py3.10-torch2.1/bin/python3.10
Torch: Hostname nd20-01.hpc.itc.rwth-aachen.de, pid 191191, using GPU 0.
MEMORY: main proc python3.10(191191) initial: rss=1.1GB pss=808.0MB uss=787.3MB shared=324.9MB
MEMORY: main proc python3.10(191191) increased RSS: rss=1.2GB pss=467.7MB uss=39.2MB shared=1.1GB
MEMORY: main proc python3.10(191191) increased RSS: rss=1.3GB pss=597.5MB uss=194.9MB shared=1.1GB
MEMORY: main proc python3.10(191191) increased RSS: rss=1.4GB pss=694.6MB uss=292.3MB shared=1.1GB
MEMORY: main proc python3.10(191191) increased RSS: rss=1.5GB pss=805.8MB uss=403.4MB shared=1.1GB
MEMORY: main proc python3.10(191191) increased RSS: rss=1.6GB pss=0.9GB uss=507.7MB shared=1.1GB
MEMORY: main proc python3.10(191191) increased RSS: rss=2.0GB pss=1.3GB uss=0.9GB shared=1.1GB
nd20-01:191191:191191 [0] NCCL INFO Bootstrap : Using ib0:fe80::ba59:9f03:fc:765c%ib0<0>
nd20-01:191191:191191 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No 
such file or directory
nd20-01:191191:191191 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MEMORY: main proc python3.10(191191) increased RSS: rss=2.6GB pss=2.0GB uss=1.6GB shared=1.1GB
nd20-01:191191:191191 [0] NCCL INFO cudaDriverVersion 12000
nd20-01:191191:219889 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_bond_0:1/RoCE [RO]; OOB ib0:fe80::ba59:9f03:fc:765c%ib0<0>
nd20-01:191191:219889 [0] NCCL INFO Using network IB
MEMORY: main proc python3.10(191191) increased RSS: rss=2.7GB pss=2.0GB uss=1.6GB shared=1.1GB
nd20-01:191191:219889 [0] NCCL INFO misc/socket.cc:564 -> 2
nd20-01:191191:219889 [0] NCCL INFO misc/socket.cc:615 -> 2
nd20-01:191191:219889 [0] NCCL INFO bootstrap.cc:270 -> 2
nd20-01:191191:219889 [0] NCCL INFO init.cc:1303 -> 2
nd20-01:191191:219889 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
nd20-01:191191:191191 [0] NCCL INFO group.cc:422 -> 2
nd20-01:191191:191191 [0] NCCL INFO group.cc:106 -> 2 

Looking at the NCCL code is interesting, for example misc/socket.cc. You see, the first error is this line. It uses NCCLCHECK there which produces the NCCL INFO debug output, showing the line number and that it got a non-success error code, specifically code 2. Code 2 (see here) is ncclSystemError. So socketPollConnect fails here (code). I wonder that I don't see any related warning, because in socketPollConnect, in most cases when it fails it should have printed some warning. Ok, actually not for the case of ncclSystemError in the end (else if (ret != EINPROGRESS)). But also the SYSCHECK macro could return with ncclSystemError (see here). Or also the EQCHECK. But in those cases, there should have been a warning again, which I don't see. So this would mean, it really gets into the else if (ret != EINPROGRESS), the only case (as far as I see) which would not print any warning but return with ncclSystemError.

Edit I reported this at NCCL upstream, that it would be nice to also get a warn/info for this case: NVIDIA/nccl#1099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant