PyTorch CUDA OOM in distributed training #1482

albertz · 2023-12-31T13:33:13Z

RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb
ined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lmbYlKeoU6kT/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lmbYlKeoU6kT/output/returnn.config']
...
Torch: Hostname cn-236, pid 2003531, using GPU 3.
CUDA_VISIBLE_DEVICES is set to '0,1,2,3'.
Available CUDA devices:
  1/4: cuda:0
  1/4: cuda:0
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       name: NVIDIA GeForce GTX 1080 Ti
       capability: 6.1
       device_index: 0
       total_memory: 10.9GB
  2/4: cuda:1
       name: NVIDIA GeForce GTX 1080 Ti
       capability: 6.1
       total_memory: 10.9GB
       capability: 6.1
       device_index: 0
       device_index: 1
  2/4: cuda:1
  3/4: cuda:2
       name: NVIDIA GeForce GTX 1080 Ti
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       total_memory: 10.9GB
       device_index: 2
  4/4: cuda:3
       capability: 6.1
       name: NVIDIA GeForce GTX 1080 Ti
       device_index: 1
       total_memory: 10.9GB
       capability: 6.1
  3/4: cuda:2
       device_index: 3
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 2
  4/4: cuda:3
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 3
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
...
ep 1 train, step 94, acc 0.005, loss 8.755, loss_att 9.005, loss_ctc 8.171, total 8.755, mem_usage:cuda:1 7.4GB
ep 1 train, step 94, acc 0.003, loss 8.599, loss_att 8.744, loss_ctc 8.261, total 8.599, mem_usage:cuda:0 7.4GB
ep 1 train, step 94, acc 0.003, loss 8.492, loss_att 8.639, loss_ctc 8.152, total 8.492, mem_usage:cuda:2 7.3GB
ep 1 train, step 95, acc 0.004, loss 8.657, loss_att 8.896, loss_ctc 8.100, total 8.657, mem_usage:cuda:1 7.4GB
ep 1 train, step 95, acc 0.007, loss 9.245, loss_att 9.607, loss_ctc 8.400, total 9.245, mem_usage:cuda:3 7.4GB
ep 1 train, step 95, acc 0.003, loss 8.452, loss_att 8.596, loss_ctc 8.116, total 8.452, mem_usage:cuda:0 7.4GB
ep 1 train, step 95, acc 0.004, loss 8.648, loss_att 8.824, loss_ctc 8.238, total 8.648, mem_usage:cuda:2 7.3GB
MEMORY: sub proc watch memory(2003717) increased RSS: rss=52.4MB pss=30.8MB uss=30.6MB shared=21.7MB
ep 1 train, step 96, acc 0.003, loss 8.667, loss_att 8.789, loss_ctc 8.382, total 8.667, mem_usage:cuda:3 7.4GB
ep 1 train, step 96, acc 0.001, loss 8.325, loss_att 8.352, loss_ctc 8.261, total 8.325, mem_usage:cuda:1 7.4GB
ep 1 train, step 96, acc 0.005, loss 8.874, loss_att 9.176, loss_ctc 8.168, total 8.874, mem_usage:cuda:0 7.4GB
ep 1 train, step 96, acc 0.001, loss 8.337, loss_att 8.333, loss_ctc 8.346, total 8.337, mem_usage:cuda:2 7.3GB
MEMORY: total (main 2003529, 2023-12-31, 13:25:48, 20 procs): pss=8.4GB uss=6.2GB
ep 1 train, step 97, acc 0.003, loss 8.599, loss_att 8.779, loss_ctc 8.178, total 8.599, mem_usage:cuda:0 7.4GB
ep 1 train, step 97, acc 0.003, loss 8.641, loss_att 8.805, loss_ctc 8.257, total 8.641, mem_usage:cuda:3 7.4GB
ep 1 train, step 97, acc 0.004, loss 8.589, loss_att 8.768, loss_ctc 8.170, total 8.589, mem_usage:cuda:1 7.4GB
ep 1 train, step 97, acc 0.003, loss 8.568, loss_att 8.694, loss_ctc 8.272, total 8.568, mem_usage:cuda:2 7.3GB
ep 1 train, step 98, acc 0.001, loss 8.286, loss_att 8.344, loss_ctc 8.151, total 8.286, mem_usage:cuda:0 7.4GB
ep 1 train, step 98, acc 0.005, loss 8.941, loss_att 9.186, loss_ctc 8.369, total 8.941, mem_usage:cuda:3 7.4GB
ep 1 train, step 98, acc 0.002, loss 8.423, loss_att 8.466, loss_ctc 8.324, total 8.423, mem_usage:cuda:1 7.4GB
ep 1 train, step 98, acc 0.004, loss 8.574, loss_att 8.728, loss_ctc 8.216, total 8.574, mem_usage:cuda:2 7.3GB
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 139662461636608)>, proc 2003528.
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 98, in DistributedContext.step_after_param_update
    line: _sync_params_avg(module=module)
    locals:
      _sync_params_avg = <global> <function _sync_params_avg at 0x7f04dc6ba520>
      module = <local> ESPnetASRModel(
                         (frontend): DefaultFrontend(
                           (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                           (frontend): Frontend()
                           (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                         )
                         (specaug): SpecAug(
                           (t...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 152, in _sync_params_avg
    line: dist.all_reduce(param.data, op=reduce_op)
    locals:
      dist = <local> <module 'torch.distributed' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/__i
nit__.py'>
      dist.all_reduce = <local> <function all_reduce at 0x7f05078fcae0>
      param = <local> Parameter containing:
                      Parameter[512, 1, 3, 3] n=4608 (18Kb) x∈[-0.333, 0.333] μ=0.000 σ=0.195 grad cuda:0
      param.data = <local> tensor[512, 1, 3, 3] n=4608 (18Kb) x∈[-0.333, 0.333] μ=0.000 σ=0.195 cuda:0
      op = <not found>
      reduce_op = <local> <RedOpType.AVG: 1>
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in send
    line: return func(*args, **kwargs)
    locals:
      func = <local> <function all_reduce at 0x7f05078fca40>
      args = <local> (tensor[512, 1, 3, 3] n=4608 (18Kb) x∈[-0.333, 0.333] μ=0.000 σ=0.195 cuda:0,)
      kwargs = <local> {'op': <RedOpType.AVG: 1>}
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    line: work = group.allreduce([tensor], opts)
    locals:
      work = <not found>
      group = <local> <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f04de15dc70>
      group.allreduce = <local> <bound method PyCapsule.allreduce of <torch.distributed.distributed_c10d.ProcessGroup object at 0x7f04de15dc70>>
      tensor = <local> tensor[512, 1, 3, 3] n=4608 (18Kb) x∈[-0.333, 0.333] μ=0.000 σ=0.195 cuda:0
      opts = <local> <torch.distributed.distributed_c10d.AllreduceOptions object at 0x7f059dce46f0>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note that RuntimeError: CUDA error: out of memory is not the usual OutOfMemoryError exception (which also provides some stats on reserved memory etc) but this comes from torch distributed and unfortunately lacks further stats.

It's a bit strange because looking at the training log before the OOM, it uses around 7.4GB (allocated, so a bit more reserved), and from the initial log, all the device memory seem to be available?

The text was updated successfully, but these errors were encountered:

albertz · 2024-01-01T01:37:01Z

This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.

albertz · 2024-01-01T01:38:09Z

Potentially related:
pytorch/pytorch#116177
NVlabs/tiny-cuda-nn#387
NVIDIA/nccl#962

albertz · 2024-01-02T11:06:39Z

I get the same problem also with Gloo backend, i.e. also CUDA OOM, although then it crashes in a different way with an abort.

...
ep 1 train, step 97, acc 0.004, loss 8.624, loss_att 8.769, loss_ctc 8.285, total 8.624, mem_usage:cuda:2 8.8GB, 0.855 sec/step
ep 1 train, step 98, acc 0.007, loss 9.071, loss_att 9.377, loss_ctc 8.356, total 9.071, mem_usage:cuda:1 8.6GB, 0.797 sec/step
ep 1 train, step 98, acc 0.004, loss 8.664, loss_att 8.846, loss_ctc 8.239, total 8.664, mem_usage:cuda:3 8.5GB, 0.801 sec/step
ep 1 train, step 98, acc 0.005, loss 8.674, loss_att 8.856, loss_ctc 8.248, total 8.674, mem_usage:cuda:0 8.9GB, 0.892 sec/step
ep 1 train, step 98, acc 0.003, loss 8.459, loss_att 8.575, loss_ctc 8.190, total 8.459, mem_usage:cuda:2 8.8GB, 0.834 sec/step
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fda36535617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fda364f098d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fda365f09f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) 
frame #3: <unknown function> + 0x1d104 (0x7fda365c0104 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x4bc384a (0x7fd9e5be384a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #5: <unknown function> + 0x559d0a8 (0x7fd9e65bd0a8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #6: c10d::ProcessGroupGloo::AsyncWork::execute(c10::intrusive_ptr<c10d::ProcessGroupGloo::AsyncWork, c10::detail::intrusive_target_default_null_typ
e<c10d::ProcessGroupGloo::AsyncWork> >) + 0x3b (0x7fd9e65cbf8b in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/
libtorch_cpu.so)
frame #7: c10d::ProcessGroupGloo::runLoop(int) + 0xe9 (0x7fd9e65cc099 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/tor
ch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xdba24 (0x7fda369dda24 in /work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6)
frame #9: <unknown function> + 0x8523e (0x7fda6157023e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #10: <unknown function> + 0x10617c (0x7fda615f117c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fda4f220640 (most recent call first):
  <no Python frame>

Thread 0x00007fd90f6ae640 (most recent call first):
  <no Python frame>

Thread 0x00007fd90cead640 (most recent call first):
  <no Python frame>

Thread 0x00007fd911eaf640 (most recent call first):
  <no Python frame>

Thread 0x00007fd9006ac640 (most recent call first):
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 320 in wait
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007fda614ea000 (most recent call first):
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2055 in all_reduce
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 160 in _sync_params_avg
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 99 in step_after_param_update
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 389 in train_epoch
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 239 in train
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 465 in execute_main_task
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 659 in main
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._c
...
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7fda2a87320b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fda61527f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fda61571e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fda61527ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fda61527f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fda61571e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fda61527ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7fda6151345c]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xa586a)[0x7fda369a786a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb107a)[0x7fda369b307a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb10e5)[0x7fda369b30e5]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb1338)[0x7fda369b3338]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jRKSs+0x94)[0x7fda3
64f09bd]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(_ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_
ib+0x118)[0x7fda365f09f8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x1d104)[0x7fda365c0104]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4bc384a)[0x7fd9e5be384a]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x559d0a8)[0x7fd9e65bd0a8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN4c10d16ProcessGroupGloo9AsyncWork7executeEN3c10
13intrusive_ptrIS1_NS2_6detail34intrusive_target_default_null_typeIS1_EEEE+0x3b)[0x7fd9e65cbf8b]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN4c10d16ProcessGroupGloo7runLoopEi+0xe9)[0x7fd9e
65cc099] 
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xdba24)[0x7fda369dda24]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x8523e)[0x7fda6157023e]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x10617c)[0x7fda615f117c]
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5418512617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/si
te-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f54184cd98d in /work/tools/users/zeyer/py-en
vs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f54185cd9f8 in /work/tools/users/zeyer/py-envs/p
y3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d104 (0x7f541859d104 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_c
uda.so)
frame #4: <unknown function> + 0x4bc384a (0x7f53d37e384a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #5: <unknown function> + 0x559d0a8 (0x7f53d41bd0a8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so) 
frame #6: c10d::ProcessGroupGloo::AsyncWork::execute(c10::intrusive_ptr<c10d::ProcessGroupGloo::AsyncWork, c10::detail::intrusive_target_default_null_typ
e<c10d::ProcessGroupGloo::AsyncWork> >) + 0x3b (0x7f53d41cbf8b in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/
libtorch_cpu.so) 
frame #7: c10d::ProcessGroupGloo::runLoop(int) + 0xe9 (0x7f53d41cc099 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/tor
ch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xdba24 (0x7f54244b8a24 in /work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6)
frame #9: <unknown function> + 0x8523e (0x7f544f05123e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #10: <unknown function> + 0x10617c (0x7f544f0d217c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f543add4640 (most recent call first):
  <no Python frame>

Thread 0x00007f52fd7af640 (most recent call first):
  <no Python frame>

Thread 0x00007f52f87ad640 (most recent call first):
  <no Python frame>

Thread 0x00007f52fafae640 (most recent call first):
  <no Python frame>

Thread 0x00007f52f5fac640 (most recent call first):
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 320 in wait
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f544efcb000 (most recent call first):
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2055 in all_reduce
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 160 in _sync_params_avg
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 99 in step_after_param_update
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 389 in train_epoch
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 239 in train
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 465 in execute_main_task
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 659 in main
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._c
...

In this case, as you see, all the workers crash in the same way.

albertz · 2024-01-02T11:08:54Z

This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.

I realized, this is using "torch_distributed": {"reduce_type": "param", "param_sync_step": 100}, and it did not yet print the log output for the current step, which is step 99, so this is exactly the first step where it performs the param sync.

albertz · 2024-01-02T11:34:29Z

One workaround is using the newly introduced torch_distributed sync_on_cpu=True option, which first moves all params to CPU, then does the sync (which would use Gloo on CPU), then moves it back to GPU.

But why does this work? What does NCCL/Gloo do different, when the param is on GPU? This is a GeForce GTX 1080, so there is no NVlink. So I was assuming it would anyway internally move it to CPU, then do the allreduce on CPU, and then back to GPU. But probably not? Maybe it copies all params to CPU, then over network to all workers, then each copy of the param to GPU, so it has num_workers times the param in memory, and then does the reduce (AVG or SUM) on GPU? This might explain it. But I was assuming that the all_reduce is somewhat more clever, maybe does it hierarchically or so, i.e. not use this naive logic, which is not the most efficient and takes so much memory?

albertz · 2024-01-02T11:41:05Z

Note, the 1080 has 10.9GB of memory, just the parameters take only 615.9MB of memory.

The all_reduce is in blocking mode (just the default), and we do this separately for each parameter. The biggest parameter might be the embedding (512 x 100025), although that is not where it crashes. In any case, even if we would have 4 times such a big parameter in memory, it should be way more than enough memory available, so this does not really explain it.

albertz · 2024-01-02T11:49:06Z

I also asked in the forums: https://discuss.pytorch.org/t/cuda-oom-in-distributed-training-without-nvlink/194704

This was referenced Jan 1, 2024

OOM error for collection communication primitive provided by torch.distributed pytorch/pytorch#116177

Open

ncclInternalError during torch all_gather_object NVIDIA/nccl#962

Open

albertz mentioned this issue Jan 6, 2024

Memory leak in all_reduce on CPU tensors with Gloo pytorch/pytorch#116923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch CUDA OOM in distributed training #1482

PyTorch CUDA OOM in distributed training #1482

albertz commented Dec 31, 2023

albertz commented Jan 1, 2024

albertz commented Jan 1, 2024 •

edited

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

PyTorch CUDA OOM in distributed training #1482

PyTorch CUDA OOM in distributed training #1482

Comments

albertz commented Dec 31, 2023

albertz commented Jan 1, 2024

albertz commented Jan 1, 2024 • edited

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 2, 2024

albertz commented Jan 1, 2024 •

edited