Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR #1465

Open
albertz opened this issue Nov 27, 2023 · 3 comments
Open

RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR #1465

albertz opened this issue Nov 27, 2023 · 3 comments

Comments

@albertz
Copy link
Member

albertz commented Nov 27, 2023

After quite a while of training (597 subepochs) with PyTorch backend, I got:

RETURNN starting up, version 1.20231119.003753+git.c230d140, date/time 2023-11-22-02-51-52 (UTC+0000), pid 2470397, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.tNymT5UR0k6i/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
...
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
CUDA_VISIBLE_DEVICES is set to '2'.
Available CUDA devices:
  1/1: cuda:0
       name: NVIDIA A10
       total_memory: 22.0GB
       capability: 8.6
       device_index: 2
...
Using device: cuda ('gpu' in config)
Using gpu device 2: NVIDIA A10
Using autocast (automatic mixed precision (AMP)) with dtype torch.bfloat16
...
MEMORY: total (21 procs): pss=18.9GB uss=12.8GB
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1828, in TorchBackend.stft 
    line: y_raw = torch.stft( 
              x_raw,
              n_fft=fft_length,
              hop_length=frame_step,
              # PyTorch anyway uses a window with size = n_fft internally.
              # So we just explicitly handle the window logic.
              win_length=fft_length,
              window=window_pt,
              center=False,
              return_complex=True,
          )
    locals:
      y_raw = <not found>
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.stft = <global> <function stft at 0x7f0953301120>
      x_raw = <local> tensor([[ 0.0008,  0.0048,  0.0020,  ...,  0.0000,  0.0000,  0.0000],
                              [-0.0016, -0.0014, -0.0010,  ...,  0.0000,  0.0000,  0.0000],
                              [ 0.0447,  0.0520,  0.0451,  ...,  0.0000,  0.0000,  0.0000],
                              ...,
                              [ 0.0028,  0.0017, -0.0002,  ...,  0.0000,  0.0000,  0.0000],
                          ...
      n_fft = <not found>
      fft_length = <local> 512
      hop_length = <not found>
      frame_step = <local> 160
...
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
...
[2023-11-25 00:51:54,778] INFO: Max resources: Run time: 70:00:02 CPU: 104.4% RSS: 30.67GB VMS: 157.17GB
--------------------- Slurm Task Epilog ------------------------
Job ID: 3103297
Time: Sat Nov 25 01:51:54 AM CET 2023
Elapsed Time: 2-22:00:04
Billing per second for TRES: billing=248,cpu=4,gres/gpu=1,mem=30G,node=1
Show resource usage with e.g.:
sacct -j 3103297 -o Elapsed,TotalCPU,UserCPU,SystemCPU,MaxRSS,ReqTRES%60,MaxDiskRead,MaxDiskWrite
--------------------- Slurm Task Epilog ------------------------

I'm not sure yet if this is reproducible or just was bad luck.

RSS: 30.67GB is suspiciously close to the limit (30G), so maybe some internal cuFFT malloc failed and it just caused this generic error?

@albertz
Copy link
Member Author

albertz commented Jan 3, 2024

I'm closing for now, assuming a hardware issue. Reopen if there is any indication that there is some other problem, or sth we can do about it.

@albertz albertz closed this as completed Jan 3, 2024
@albertz albertz reopened this Feb 6, 2024
@albertz
Copy link
Member Author

albertz commented Feb 6, 2024

I'm getting this quite frequently now. In most cases in a multi-GPU training setup on Nvidia 1080 GPUs (but probably that's just because that is currently my main setup, and otherwise not related to that).

It is very suspicious that it seems deterministic in some cases, i.e. I get the error exactly in the same step, after restarting a couple of times. This is unlikely for a hardware issue, and looks more like a software issue. It's also unlikely that there is something wrong on RETURNN side. Probably it's on CUDA or the GPU driver. But maybe there is still sth we can do about it?

In any case, I'm reopening this now, to collect some more information.

--------------------- Slurm Task Prolog ------------------------
Job ID: 4413656
Job name: i6_core.returnn.training.ReturnnTrainingJob.TwlSCwc9j5iU.run
Host: cn-264
Date: Mon Feb  5 09:39:13 AM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_11gb
Work dir: 
------------------
...
RETURNN starting up, version 1.20240204.021831+git.b3616628, date/time 2024-02-05-08-39-37 (UTC+0000), pid 157769, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/output/returnn.config']
Hostname: cn-264
Installed native_signal_handler.so.
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
...
  1/4: cuda:0
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 0
  2/4: cuda:1
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 1
  3/4: cuda:2
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 2
  4/4: cuda:3
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 3 
...
Load model /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/output/models/epoch.080.pt
  epoch 80, global train step 193041
...
ep 81 train, step 2407, total 2.449, loss_ctc 89.849, loss_att 120.335, acc 0.574, loss 111.189, num_seqs 5, max_size:time 231440, max_size:out-spatial 53, mem_usage:cuda:0 8.9GB, 1.092 sec/step
ep 81 train, step 2407, total 2.784, loss_ctc 117.422, loss_att 130.652, acc 0.546, loss 126.683, num_seqs 4, max_size:time 256609, max_size:out-spatial 48, mem_usage:cuda:1 9.0GB, 1.098 sec/step
ep 81 train, step 2407, total 2.343, loss_ctc 78.286, loss_att 109.050, acc 0.570, loss 99.821, num_seqs 5, max_size:time 227760, max_size:out-spatial 45, mem_usage:cuda:3 8.9GB, 1.098 sec/step
ep 81 train, step 2407, total 3.214, loss_ctc 118.303, loss_att 138.481, acc 0.461, loss 132.427, num_seqs 5, max_size:time 253969, max_size:out-spatial 45, mem_usage:cuda:2 8.9GB, 1.102 sec/step
ep 81 train, step 2408, total 2.151, loss_ctc 69.431, loss_att 103.593, acc 0.629, loss 93.344, num_seqs 5, max_size:time 230960, max_size:out-spatial 54, mem_usage:cuda:0 8.9GB, 1.105 sec/step
ep 81 train, step 2408, total 2.680, loss_ctc 99.864, loss_att 127.553, acc 0.617, loss 119.247, num_seqs 4, max_size:time 256081, max_size:out-spatial 59, mem_usage:cuda:1 9.0GB, 1.097 sec/step
ep 81 train, step 2408, total 2.675, loss_ctc 108.494, loss_att 132.354, acc 0.468, loss 125.196, num_seqs 5, max_size:time 230080, max_size:out-spatial 49, mem_usage:cuda:2 8.9GB, 1.096 sec/step
ep 81 train, step 2408, total 2.674, loss_ctc 100.278, loss_att 125.106, acc 0.514, loss 117.658, num_seqs 5, max_size:time 250185, max_size:out-spatial 48, mem_usage:cuda:3 8.9GB, 1.103 sec/step
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
...
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Module call stack:
(DistributedDataParallel.forward) (unknown)
(_WrappedModuleRunStep.forward) (unknown)
(ESPnetASRModel.forward) (root)
(ESPnetASRModel.encode) (root)
(DefaultFrontend.forward) frontend
(Stft.forward) frontend.stft

Next error, different host, but same step:

--------------------- Slurm Task Prolog ------------------------
Job ID: 4414705
Job name: i6_core.returnn.training.ReturnnTrainingJob.TwlSCwc9j5iU.run
Host: cn-252
Date: Mon Feb  5 01:50:39 PM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_11gb
Work dir: 
------------------
...
[2024-02-05 12:50:52,539] INFO: Start Job: Job<alias/exp2023_04_25_rf/espnet/v6-11gb-f32-bs8k-accgrad100-mgpu4-wd1e_4-lrlin1e_5_558k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU> Task: run
...
RETURNN starting up, version 1.20240205.115240+git.4b70bd63, date/time 2024-02-05-12-51-06 (UTC+0000), pid 2448, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
...
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
...
Available CUDA devices:
  1/4: cuda:0 
       name: NVIDIA GeForce GTX 1080 Ti 
       total_memory: 10.9GB
       capability: 6.1 
       device_index: 0 
  2/4: cuda:1
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB 
       capability: 6.1 
       device_index: 1
  3/4: cuda:2
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 2
  4/4: cuda:3
       name: NVIDIA GeForce GTX 1080 Ti
       total_memory: 10.9GB
       capability: 6.1
       device_index: 3
...
Load model /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/output/models/epoch.080.pt
Load model /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/output/models/epoch.080.pt
Load model /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.TwlSCwc9j5iU/output/models/epoch.080.pt
MEMORY: main proc python3.11(2450) increased RSS: rss=2.1GB pss=1.5GB uss=1.0GB shared=1.1GB
MEMORY: main proc python3.11(2448) increased RSS: rss=2.1GB pss=1.5GB uss=1.0GB shared=1.1GB
MEMORY: main proc python3.11(2447) increased RSS: rss=2.1GB pss=1.5GB uss=1.0GB shared=1.1GB
[2024-02-05 12:53:48,784] INFO: Run time: 0:02:56 CPU: 0.20% RSS: 15.94GB VMS: 612.68GB
MEMORY: total (main 2450, 2024-02-05, 12:53:48, 16 procs): pss=3.7GB uss=2.9GB
MEMORY: total (main 2448, 2024-02-05, 12:53:48, 16 procs): pss=3.7GB uss=2.9GB
MEMORY: total (main 2447, 2024-02-05, 12:53:49, 16 procs): pss=3.7GB uss=2.9GB
MEMORY: main proc python3.11(2449) increased RSS: rss=2.2GB pss=1.5GB uss=1.1GB shared=1.1GB
MEMORY: total (main 2449, 2024-02-05, 12:53:50, 16 procs): pss=3.5GB uss=2.7GB
  epoch 80, global train step 193041
  epoch 80, global train step 193041
...
ep 81 train, step 3, total 2.408, loss_ctc 29.650, loss_att 35.661, acc 0.608, loss 33.858, num_seqs 17, max_size:time 74008, max_size:out-spatial 17, mem_usage:cuda:3 7.7GB, 1.037 sec/step
ep 81 train, step 3, total 2.212, loss_ctc 26.790, loss_att 29.600, acc 0.663, loss 28.757, num_seqs 16, max_size:time 78056, max_size:out-spatial 17, mem_usage:cuda:1 7.5GB, 1.295 sec/step
ep 81 train, step 3, total 2.575, loss_ctc 28.373, loss_att 35.880, acc 0.644, loss 33.628, num_seqs 17, max_size:time 74160, max_size:out-spatial 18, mem_usage:cuda:2 7.5GB, 1.298 sec/step
ep 81 train, step 3, total 2.823, loss_ctc 34.004, loss_att 41.347, acc 0.661, loss 39.144, num_seqs 15, max_size:time 81400, max_size:out-spatial 22, mem_usage:cuda:0 7.4GB, 1.299 sec/step
ep 81 train, step 4, total 2.704, loss_ctc 34.874, loss_att 42.986, acc 0.623, loss 40.553, num_seqs 13, max_size:time 92841, max_size:out-spatial 20, mem_usage:cuda:0 7.4GB, 1.113 sec/step
ep 81 train, step 4, total 3.287, loss_ctc 40.730, loss_att 48.615, acc 0.594, loss 46.249, num_seqs 14, max_size:time 87120, max_size:out-spatial 21, mem_usage:cuda:3 7.7GB, 1.372 sec/step
ep 81 train, step 4, total 3.295, loss_ctc 41.731, loss_att 52.723, acc 0.471, loss 49.426, num_seqs 14, max_size:time 86416, max_size:out-spatial 19, mem_usage:cuda:1 7.5GB, 1.122 sec/step
ep 81 train, step 4, total 2.861, loss_ctc 35.837, loss_att 45.660, acc 0.526, loss 42.713, num_seqs 14, max_size:time 87911, max_size:out-spatial 18, mem_usage:cuda:2 7.5GB, 1.123 sec/step
...
ep 81 train, step 2408, total 2.100, loss_ctc 67.050, loss_att 101.454, acc 0.636, loss 91.133, num_seqs 5, max_size:time 230960, max_size:out-spatial 54, mem_usage:cuda:0 8.9GB, 1.168 sec/step
ep 81 train, step 2408, total 2.690, loss_ctc 109.015, loss_att 133.119, acc 0.464, loss 125.888, num_seqs 5, max_size:time 230080, max_size:out-spatial 49, mem_usage:cuda:2 8.9GB, 1.166 sec/step
ep 81 train, step 2408, total 2.696, loss_ctc 101.865, loss_att 125.834, acc 0.506, loss 118.643, num_seqs 5, max_size:time 250185, max_size:out-spatial 48, mem_usage:cuda:3 8.9GB, 1.156 sec/step
ep 81 train, step 2408, total 2.678, loss_ctc 99.255, loss_att 127.715, acc 0.608, loss 119.177, num_seqs 4, max_size:time 256081, max_size:out-spatial 59, mem_usage:cuda:1 9.0GB, 1.173 sec/step
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR 
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140152194236416)>, proc 2447.
 
...
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/functional.py", line 650, in stft
    line: return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
                          normalized, onesided, return_complex)
    locals:
      _VF = <global> <module 'torch._VF'>
      _VF.stft = <global> <built-in method stft of type object at 0x7f774efaeaa0>
      input = <local> tensor[5, 231312] n=1156560 (4.4Mb) x∈[-1.000, 1.000] μ=7.307e-05 σ=0.093 cuda:0
      n_fft = <local> 512
      hop_length = <local> 160
      win_length = <local> 512
      window = <local> tensor[512] 2Kb x∈[0., 1.000] μ=0.500 σ=0.354 cuda:0
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Module call stack:
(DistributedDataParallel.forward) (unknown)
(_WrappedModuleRunStep.forward) (unknown)
(ESPnetASRModel.forward) (root)
(ESPnetASRModel.encode) (root)
(DefaultFrontend.forward) frontend
(Stft.forward) frontend.stft

Then again, now a bit earlier:

--------------------- Slurm Task Prolog ------------------------
Job ID: 4417153
Job name: i6_core.returnn.training.ReturnnTrainingJob.TwlSCwc9j5iU.run 
Host: cn-262
Date: Tue Feb  6 12:07:50 AM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_11gb
Work dir: 
------------------
...
ep 81 train, step 1421, total 2.562, loss_ctc 41.284, loss_att 52.191, acc 0.659, loss 48.919, num_seqs 11, max_size:time 111937, max_size:out-spatial 27, mem_usage:cuda:1 8.9GB, 0.821 sec/step
ep 81 train, step 1421, total 2.660, loss_ctc 55.430, loss_att 68.206, acc 0.562, loss 64.373, num_seqs 10, max_size:time 121793, max_size:out-spatial 28, mem_usage:cuda:0 8.9GB, 0.828 sec/step
ep 81 train, step 1421, total 2.574, loss_ctc 16.478, loss_att 18.677, acc 0.749, loss 18.017, num_seqs 26, max_size:time 48313, max_size:out-spatial 12, mem_usage:cuda:2 8.9GB, 0.831 sec/step
ep 81 train, step 1421, total 2.693, loss_ctc 32.544, loss_att 38.711, acc 0.647, loss 36.861, num_seqs 16, max_size:time 77000, max_size:out-spatial 19, mem_usage:cuda:3 8.8GB, 0.939 sec/step
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR 
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 140303148154880)>, proc 3332.

...
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/functional.py", line 650, in stft
    line: return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
                          normalized, onesided, return_complex)
    locals:
      _VF = <global> <module 'torch._VF'>
      _VF.stft = <global> <built-in method stft of type object at 0x7f9a747aeaa0>
      input = <local> tensor[11, 115969] n=1275659 (4.9Mb) x∈[-1.072, 1.125] μ=0.002 σ=0.113 cuda:0
      n_fft = <local> 512
      hop_length = <local> 160
      win_length = <local> 512
      window = <local> tensor[512] 2Kb x∈[0., 1.000] μ=0.500 σ=0.354 cuda:0
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Module call stack:
(DistributedDataParallel.forward) (unknown)
(_WrappedModuleRunStep.forward) (unknown)
(ESPnetASRModel.forward) (root)
(ESPnetASRModel.encode) (root)
(DefaultFrontend.forward) frontend
(Stft.forward) frontend.stft

Again same:

--------------------- Slurm Task Prolog ------------------------
Job ID: 4422770
Job name: i6_core.returnn.training.ReturnnTrainingJob.TwlSCwc9j5iU.run
Host: cn-262
Date: Wed Feb  7 09:25:17 AM CET 2024
User: zeyer
Slurm account: hlt
Slurm partition: gpu_11gb
Work dir: 
------------------
...
ep 81 train, step 2407, total 2.436, loss_ctc 89.442, loss_att 119.651, acc 0.578, loss 110.588, num_seqs 5, max_size:time 231440, max_size:out-spatial 53, mem_
usage:cuda:0 8.9GB, 1.724 sec/step
ep 81 train, step 2407, total 3.195, loss_ctc 117.104, loss_att 137.885, acc 0.487, loss 131.651, num_seqs 5, max_size:time 253969, max_size:out-spatial 45, mem
_usage:cuda:2 8.9GB, 1.730 sec/step
ep 81 train, step 2407, total 2.805, loss_ctc 119.414, loss_att 131.126, acc 0.546, loss 127.612, num_seqs 4, max_size:time 256609, max_size:out-spatial 48, mem
_usage:cuda:1 9.0GB, 1.729 sec/step
ep 81 train, step 2407, total 2.291, loss_ctc 77.118, loss_att 106.382, acc 0.565, loss 97.603, num_seqs 5, max_size:time 227760, max_size:out-spatial 45, mem_u
sage:cuda:3 8.9GB, 1.739 sec/step
ep 81 train, step 2408, total 2.130, loss_ctc 68.057, loss_att 102.882, acc 0.625, loss 92.435, num_seqs 5, max_size:time 230960, max_size:out-spatial 54, mem_u
sage:cuda:0 8.9GB, 0.849 sec/step
ep 81 train, step 2408, total 2.683, loss_ctc 100.435, loss_att 127.488, acc 0.617, loss 119.372, num_seqs 4, max_size:time 256081, max_size:out-spatial 59, mem
_usage:cuda:1 9.0GB, 0.849 sec/step
ep 81 train, step 2408, total 2.674, loss_ctc 108.841, loss_att 132.129, acc 0.456, loss 125.143, num_seqs 5, max_size:time 230080, max_size:out-spatial 49, mem
_usage:cuda:2 8.9GB, 0.852 sec/step
ep 81 train, step 2408, total 2.635, loss_ctc 98.556, loss_att 123.398, acc 0.522, loss 115.945, num_seqs 5, max_size:time 250185, max_size:out-spatial 48, mem_
usage:cuda:3 8.9GB, 0.844 sec/step
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 139981035896832)>, proc 51917.

...
    line: feats, feats_lengths = self.frontend(speech, speech_lengths)
    locals:
      feats = <not found>
      feats_lengths = <not found>
      self = <local> ESPnetASRModel( 
                       (frontend): DefaultFrontend( 
                         (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                         (frontend): Frontend()
                         (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                       )
                       (specaug): SpecAug(
                         (t...
      self.frontend = <local> DefaultFrontend(
                                (stft): Stft(n_fft=512, win_length=512, hop_length=160, center=True, normalized=False, onesided=True)
                                (frontend): Frontend()
                                (logmel): LogMel(sr=16000, n_fft=512, n_mels=80, fmin=0, fmax=8000.0, htk=False)
                              )
      speech = <local> tensor[5, 230800] n=1154000 (4.4Mb) x∈[-1.000, 1.000] μ=7.285e-05 σ=0.093 cuda:0
      speech_lengths = <local> tensor[5] i32 x∈[230400, 230800] μ=2.307e+05 σ=165.892 [230720, 230800, 230720, 230800, 230400]
...
  File "/u/zeyer/setups/combined/2021-05-31/tools/espnet/espnet2/layers/stft.py", line 104, in Stft.forward
    line: output = torch.stft(input, **stft_kwargs)
    locals:
      output = <not found>
      torch = <global> <module 'torch' from '/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/__init__.py'>
      torch.stft = <global> <function stft at 0x7f4f77784e00>
      input = <local> tensor[5, 230800] n=1154000 (4.4Mb) x∈[-1.000, 1.000] μ=7.285e-05 σ=0.093 cuda:0
      stft_kwargs = <local> {'n_fft': 512, 'win_length': 512, 'hop_length': 160, 'center': True, 'window': tensor[512] 2Kb x∈[0., 1.000] μ=0.500 σ=0.354 cuda:0, 'normalized': False, 'onesided': True, 'return_complex': True}, len = 8
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/functional.py", line 650, in stft
    line: return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
                          normalized, onesided, return_complex)
    locals:
      _VF = <global> <module 'torch._VF'>
      _VF.stft = <global> <built-in method stft of type object at 0x7f4f751aeaa0>
      input = <local> tensor[5, 231312] n=1156560 (4.4Mb) x∈[-1.000, 1.000] μ=7.307e-05 σ=0.093 cuda:0
      n_fft = <local> 512
      hop_length = <local> 160
      win_length = <local> 512
      window = <local> tensor[512] 2Kb x∈[0., 1.000] μ=0.500 σ=0.354 cuda:0
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Module call stack:
(DistributedDataParallel.forward) (unknown)
(_WrappedModuleRunStep.forward) (unknown)
(ESPnetASRModel.forward) (root)
(ESPnetASRModel.encode) (root)
(DefaultFrontend.forward) frontend
(Stft.forward) frontend.stft

@albertz
Copy link
Member Author

albertz commented Feb 7, 2024

I reproduced the problem in a simple script: https://github.com/albertz/playground/blob/master/test-torch-stft-cufft-internal-error.py

It often happens when I allocate most GPU memory, and then do a couple of torch.stft calls. This is the same situation as I have in the training runs reported here.

I also reported the problem upstream:
pytorch/pytorch#119420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant