RuntimeError: CUDA error: unspecified launch failure #1496

albertz · 2024-01-17T23:53:50Z

RETURNN starting up, version 1.20240117.113304+git.54097989, date/time 2024-01-17-23-15-11 (UTC+0000), pid 1130069, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wmezXtjsvAck/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wmezXtjsvAck/output/returnn.config']
Hostname: cn-284
Installed native_signal_handler.so.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/
torch)
Torch: Hostname cn-284, pid 1130069, using GPU 0.
CUDA_VISIBLE_DEVICES is set to '0,4,6,7'.
Available CUDA devices:
  1/4: cuda:0
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 0
  2/4: cuda:1
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 4
  3/4: cuda:2
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 6
  4/4: cuda:3
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 7
...
ep 2 train, step 587, ctc_4 nan, ctc_8 nan, ce nan, fer 0.972, num_seqs 12, max_size:time 186640, max_size:out-spatial 45, mem_usage:cuda:3 7.5GB, 0.462 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.974, num_seqs 13, max_size:time 182960, max_size:out-spatial 48, mem_usage:cuda:3 7.5GB, 0.668 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.973, num_seqs 11, max_size:time 204601, max_size:out-spatial 46, mem_usage:cuda:0 7.5GB, 0.769 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.974, num_seqs 10, max_size:time 230825, max_size:out-spatial 45, mem_usage:cuda:2 7.4GB, 0.810 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.973, num_seqs 11, max_size:time 201609, max_size:out-spatial 47, mem_usage:cuda:1 7.5GB, 0.856 sec/ste
p
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140497631055872)>, proc 1130071.

...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/extern_data.py", line 55, in raw_dict_to_extern_data
    line: data.raw_tensor = raw_tensor.to(device)
    locals:
      data = <local> Tensor{'data', [B?,T|'time'[B?],F|F'audio'(1)]}
      data.raw_tensor = <local> None
      raw_tensor = <local> tensor[10, 228361, 1] n=2283610 (8.7Mb) x∈[-1.008, 1.023] μ=-0.001 σ=0.088
      raw_tensor.to = <local> <built-in method to of Tensor object at 0x7fc6cd1dce30>
      device = <local> 'cuda:2', len = 6
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(And then it hangs. Edit The hang is unrelated, see separate issue #1497.)

The text was updated successfully, but these errors were encountered:

albertz · 2024-01-18T00:09:24Z

One question is also why it hangs at exit. Edit Moved that as a separate issue to #1497.

albertz · 2024-01-18T00:38:56Z

RuntimeError: CUDA error: unspecified launch failure

Maybe related:
pytorch/pytorch#74235

albertz mentioned this issue Jan 18, 2024

NonDaemonicSpawnProcess hangs at exit #1497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: unspecified launch failure #1496

RuntimeError: CUDA error: unspecified launch failure #1496

albertz commented Jan 17, 2024 •

edited

albertz commented Jan 18, 2024 •

edited

albertz commented Jan 18, 2024

RuntimeError: CUDA error: unspecified launch failure #1496

RuntimeError: CUDA error: unspecified launch failure #1496

Comments

albertz commented Jan 17, 2024 • edited

albertz commented Jan 18, 2024 • edited

albertz commented Jan 18, 2024

albertz commented Jan 17, 2024 •

edited

albertz commented Jan 18, 2024 •

edited