Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: unspecified launch failure #1496

Open
albertz opened this issue Jan 17, 2024 · 2 comments
Open

RuntimeError: CUDA error: unspecified launch failure #1496

albertz opened this issue Jan 17, 2024 · 2 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 17, 2024

RETURNN starting up, version 1.20240117.113304+git.54097989, date/time 2024-01-17-23-15-11 (UTC+0000), pid 1130069, cwd /work/asr4/zeyer/setups-data/combined/20
21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wmezXtjsvAck/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wmezXtjsvAck/output/returnn.config']
Hostname: cn-284
Installed native_signal_handler.so.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/
torch)
Torch: Hostname cn-284, pid 1130069, using GPU 0.
CUDA_VISIBLE_DEVICES is set to '0,4,6,7'.
Available CUDA devices:
  1/4: cuda:0
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 0
  2/4: cuda:1
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 4
  3/4: cuda:2
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 6
  4/4: cuda:3
       name: NVIDIA GeForce RTX 2080 Ti
       total_memory: 10.8GB
       capability: 7.5
       device_index: 7
...
ep 2 train, step 587, ctc_4 nan, ctc_8 nan, ce nan, fer 0.972, num_seqs 12, max_size:time 186640, max_size:out-spatial 45, mem_usage:cuda:3 7.5GB, 0.462 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.974, num_seqs 13, max_size:time 182960, max_size:out-spatial 48, mem_usage:cuda:3 7.5GB, 0.668 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.973, num_seqs 11, max_size:time 204601, max_size:out-spatial 46, mem_usage:cuda:0 7.5GB, 0.769 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.974, num_seqs 10, max_size:time 230825, max_size:out-spatial 45, mem_usage:cuda:2 7.4GB, 0.810 sec/ste
p
ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.973, num_seqs 11, max_size:time 201609, max_size:out-spatial 47, mem_usage:cuda:1 7.5GB, 0.856 sec/ste
p
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140497631055872)>, proc 1130071.

...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/extern_data.py", line 55, in raw_dict_to_extern_data
    line: data.raw_tensor = raw_tensor.to(device)
    locals:
      data = <local> Tensor{'data', [B?,T|'time'[B?],F|F'audio'(1)]}
      data.raw_tensor = <local> None
      raw_tensor = <local> tensor[10, 228361, 1] n=2283610 (8.7Mb) x∈[-1.008, 1.023] μ=-0.001 σ=0.088
      raw_tensor.to = <local> <built-in method to of Tensor object at 0x7fc6cd1dce30>
      device = <local> 'cuda:2', len = 6
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(And then it hangs. Edit The hang is unrelated, see separate issue #1497.)

@albertz
Copy link
Member Author

albertz commented Jan 18, 2024

One question is also why it hangs at exit. Edit Moved that as a separate issue to #1497.

@albertz
Copy link
Member Author

albertz commented Jan 18, 2024

RuntimeError: CUDA error: unspecified launch failure

Maybe related:
pytorch/pytorch#74235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant