New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch CUDA OOM in distributed training #1482
Comments
This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes. |
Potentially related: |
I get the same problem also with Gloo backend, i.e. also CUDA OOM, although then it crashes in a different way with an abort.
In this case, as you see, all the workers crash in the same way. |
I realized, this is using |
One workaround is using the newly introduced But why does this work? What does NCCL/Gloo do different, when the param is on GPU? This is a GeForce GTX 1080, so there is no NVlink. So I was assuming it would anyway internally move it to CPU, then do the allreduce on CPU, and then back to GPU. But probably not? Maybe it copies all params to CPU, then over network to all workers, then each copy of the param to GPU, so it has num_workers times the param in memory, and then does the reduce (AVG or SUM) on GPU? This might explain it. But I was assuming that the |
Note, the 1080 has 10.9GB of memory, just the parameters take only 615.9MB of memory. The |
I also asked in the forums: https://discuss.pytorch.org/t/cuda-oom-in-distributed-training-without-nvlink/194704 |
Note that
RuntimeError: CUDA error: out of memory
is not the usualOutOfMemoryError
exception (which also provides some stats on reserved memory etc) but this comes from torch distributed and unfortunately lacks further stats.It's a bit strange because looking at the training log before the OOM, it uses around 7.4GB (allocated, so a bit more reserved), and from the initial log, all the device memory seem to be available?
The text was updated successfully, but these errors were encountered: