We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RETURNN starting up, version 1.20240117.113304+git.54097989, date/time 2024-01-17-23-15-11 (UTC+0000), pid 1130069, cwd /work/asr4/zeyer/setups-data/combined/20 21-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wmezXtjsvAck/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.wmezXtjsvAck/output/returnn.config'] Hostname: cn-284 Installed native_signal_handler.so. PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/ torch) Torch: Hostname cn-284, pid 1130069, using GPU 0. CUDA_VISIBLE_DEVICES is set to '0,4,6,7'. Available CUDA devices: 1/4: cuda:0 name: NVIDIA GeForce RTX 2080 Ti total_memory: 10.8GB capability: 7.5 device_index: 0 2/4: cuda:1 name: NVIDIA GeForce RTX 2080 Ti total_memory: 10.8GB capability: 7.5 device_index: 4 3/4: cuda:2 name: NVIDIA GeForce RTX 2080 Ti total_memory: 10.8GB capability: 7.5 device_index: 6 4/4: cuda:3 name: NVIDIA GeForce RTX 2080 Ti total_memory: 10.8GB capability: 7.5 device_index: 7 ... ep 2 train, step 587, ctc_4 nan, ctc_8 nan, ce nan, fer 0.972, num_seqs 12, max_size:time 186640, max_size:out-spatial 45, mem_usage:cuda:3 7.5GB, 0.462 sec/ste p ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.974, num_seqs 13, max_size:time 182960, max_size:out-spatial 48, mem_usage:cuda:3 7.5GB, 0.668 sec/ste p ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.973, num_seqs 11, max_size:time 204601, max_size:out-spatial 46, mem_usage:cuda:0 7.5GB, 0.769 sec/ste p ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.974, num_seqs 10, max_size:time 230825, max_size:out-spatial 45, mem_usage:cuda:2 7.4GB, 0.810 sec/ste p ep 2 train, step 588, ctc_4 nan, ctc_8 nan, ce nan, fer 0.973, num_seqs 11, max_size:time 201609, max_size:out-spatial 47, mem_usage:cuda:1 7.5GB, 0.856 sec/ste p RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140497631055872)>, proc 1130071. ... File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/data/extern_data.py", line 55, in raw_dict_to_extern_data line: data.raw_tensor = raw_tensor.to(device) locals: data = <local> Tensor{'data', [B?,T|'time'[B?],F|F'audio'(1)]} data.raw_tensor = <local> None raw_tensor = <local> tensor[10, 228361, 1] n=2283610 (8.7Mb) x∈[-1.008, 1.023] μ=-0.001 σ=0.088 raw_tensor.to = <local> <built-in method to of Tensor object at 0x7fc6cd1dce30> device = <local> 'cuda:2', len = 6 RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(And then it hangs. Edit The hang is unrelated, see separate issue #1497.)
The text was updated successfully, but these errors were encountered:
One question is also why it hangs at exit. Edit Moved that as a separate issue to #1497.
Sorry, something went wrong.
RuntimeError: CUDA error: unspecified launch failure
Maybe related: pytorch/pytorch#74235
No branches or pull requests
(And then it hangs. Edit The hang is unrelated, see separate issue #1497.)
The text was updated successfully, but these errors were encountered: