New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR #1465
Comments
I'm closing for now, assuming a hardware issue. Reopen if there is any indication that there is some other problem, or sth we can do about it. |
I'm getting this quite frequently now. In most cases in a multi-GPU training setup on Nvidia 1080 GPUs (but probably that's just because that is currently my main setup, and otherwise not related to that). It is very suspicious that it seems deterministic in some cases, i.e. I get the error exactly in the same step, after restarting a couple of times. This is unlikely for a hardware issue, and looks more like a software issue. It's also unlikely that there is something wrong on RETURNN side. Probably it's on CUDA or the GPU driver. But maybe there is still sth we can do about it? In any case, I'm reopening this now, to collect some more information.
Next error, different host, but same step:
Then again, now a bit earlier:
Again same:
|
I reproduced the problem in a simple script: https://github.com/albertz/playground/blob/master/test-torch-stft-cufft-internal-error.py It often happens when I allocate most GPU memory, and then do a couple of I also reported the problem upstream: |
After quite a while of training (597 subepochs) with PyTorch backend, I got:
I'm not sure yet if this is reproducible or just was bad luck.
RSS: 30.67GB is suspiciously close to the limit (30G), so maybe some internal cuFFT malloc failed and it just caused this generic error?
The text was updated successfully, but these errors were encountered: