New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclSystemError: Cannot assign requested address #1466
Comments
Now slightly different:
|
With Filtering only rank 0:
Last debug info of all ranks:
|
Now after a restart (also the ITC restarted the nodes), I'm not sure if I see the same error. I had another error in my setup to use Torch AMP with bfloat16, but the V100 GPU does not support that. After fixing that, I had another bug with CUDA device mixup, now also fixed (200b7a4). Now I get a GPU OOM error: The procs correspond to all the other procs, as you see here:
I wonder a bit about this. It means every process reserved some memory on every other GPU. This seems suboptimal? But this is probably unrelated to the original issue here. And additionally:
This might just be a followup error of the OOM, but I'm not sure. |
Via #1470, similar (
This is actually with the demo from here (also here), but the same error happens also with RETURNN. This also seems non-deterministic. After a restart of the job, even running
It's really strange. Sometimes it also hangs after this output:
In
Also,
The |
So, I think now, in case this problem happens on the IB, this is maybe a (non-deterministic) hardware issue. |
Again on DGX, I'm debugging probably the same or a similar issue.
Looking at the NCCL code is interesting, for example misc/socket.cc. You see, the first error is this line. It uses Edit I reported this at NCCL upstream, that it would be nice to also get a warn/info for this case: NVIDIA/nccl#1099 |
In PyTorch distributed training, I get:
Maybe related to that:
Originally posted by @albertz in rwth-i6/i6_core#459 (comment)
The text was updated successfully, but these errors were encountered: