New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CudnnLSTM variable sequence length sometimes fails with CUDNN_STATUS_EXECUTION_FAILED #41630
Comments
Repro material. To repro with r1.15 tensorflow on current Debian Sid, I needed to use pyenv to have a local 3.7 python (latest version supported by r1.15), but if you run it on a system-provided 3.7, you'd prefer remove those line and just setup a virtualenv. STR:
The zip file also features some Dockerfile, if it helps for repro. |
More infos / logsFailure log is (with some extra debug by myself):
We verified a few things:
|
Please do not hesitate to ping myself or @applied-machinelearning if you have troubles reproducing the issue, but with the provided material it should be straightforward. |
@timshen91 can you PTAL? |
Can you try to fetch the cudnn logs with the following env vars? And attach the somefile.log (If this is too large, we might only need the last part which contains the cudnnRNNForwardTrainingEx). @lissyx export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=somefile.log More details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#api-logging |
issue3088.log |
I swear I search a long time for a CUDNN-related debug env variable but never found that :/ |
An initial investigation shows the padding mode is suspicious. We shouldn't see CUDNN_RNN_PADDED_IO_DISABLED here, which is incompatible with the CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED below. This case uses the variable sequence lengths [74, 75] and max seq length is 75. The time/seq major layout is used. From the logic here: https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/core/kernels/cudnn_rnn_ops.cc#L944-L956, the use_padded_io should be set |
Just one more thing: can you give a shot with
|
Wait, I might have made a mistake when generating those logs and left a |
Here is the log, with debug enabled, after ensuring Now, here are the logs with the same TensorFlow setup, and This does makes me wonder if we might not be in trouble because of how we use CudnnLSTM: https://github.com/mozilla/DeepSpeech/blob/9e023660ef1c0947b0f8c8db54b5cc9174c7076a/training/deepspeech_training/train.py#L108-L132 |
So I verified, changed our code to avoid this like that:
Unfortunately, the issue does still repro. |
So @kaixih what's your take with those new logs and the extra verification I performed on our code ? Don't hesitate if you need more logging, I'll do it as quickly as I can :) |
So, for now two things can be verified:
|
Yes, that is for sure.
Should I check that in the cudnn logs or at tensorflow framework level (i.e., return value of
I will verify that. |
It is indeed weird: so far, it looks like we do return |
So, it would looks like we are calling |
|
@kaixih So, after checking, I'm wondering if there's not something missing for creating the cache key: We use
is obviously different from:
But according to the log above, we still had a cache hit on this. Checking the definition of One can see that this method does not take into account At least, with that patch, I can't repro anymore the issue, and I can see cache misses and cache hits:
I'm unsure if this is correct, maybe we should just include This patch also works (and honestly, it does feel nicer):
What's your take @kaixih on this? Does that looks like the root cause of the problem to you? |
@lissyx |
Thanks, no offense but I'll claim victory once I get feedback from @kaixih :) |
@lissyx Thanks, I agree with you that this should be the root cause. (This also reminds me that I probably forgot to add corresponding changes to the cached RNN desc searching back when adding the variable sequence length feature for cuDNN). For your two patches, I prefer to change both or only the first one. The first one about the IsCompatibleWith used here is mainly for the probing in the hash table, but might be used elsewhere to compare the RNN states. And I think it makes sense to distinguish RNNs with different max_seq_length. The second one about the hash function is mainly for generating the hash key for the RNN states and your patch could give us a little faster searching performance (because we can skip the probing). So, I think the first one is probably a necessity or we simply change both. For the next step, can you create a PR to fix this bug and maybe link to this issue? Thanks. |
Sure! |
I suspect there won't be a 1.15 dot release including that kind of change, unfortunately? |
Right, I think Google only fixes major issues for 1.15. Can you first create a PR against master? And later they might pick it for 1.15 if necessary? @sanjoy |
@kaixih So I have been running experiment on ~1000h of french training data, and setting Does that sounds reasonnable to you? Are there any side-effects to using that env variable we should know about? |
@kaixih Gentle ping: we are close to release 1.0, and we'd like to know exactly what are the consequences of that flag so that we can either force it by default until a (potential 1.15.4 is released) or at least recommend people to use it? Thanks! |
@kaixih If i look at the code of:
|
Fix #41630: include max_seq_length in cudnn descriptor cache key
Just let you know guys that in my system I need to set TF_CUDNN_RESET_RND_GEN_STATE=1 flag to make it to work even after upgrading to 1.15.4 |
me aswell, it does not work without: "export TF_CUDNN_RESET_RND_GEN_STATE=1 " in the .sh file i run to fine-tune the model. But it does work with it ! so i am a happy camper again. perhaps the documentation should be updated again @lissyx ? |
No, it means there might be other cases triggering some issue. You are welcome to investigate the internals of TensorFlow / CUDNN. |
I am in the same boat as you, insanely busy with work and my toddler. what i meant is that it might be a good idea to mention this tensorflow cudnn flag in the documentation. i might look into this in February though. |
Good luck, been there, done that, I know how you are right now.
As you can see from the discussion, this flag is a workaround not a fix. If there's a documentation that might benefit from it, it would be deepspeech, not tensorflow as much as I can tell.
Please understand that tracking and fixing the root cause in TensorFlow to ensure DeepSpeech users are safe already took me weeks, specifically from the lack of reproductibility. That was when I was 100% of the time on the DeepSpeech project, which is not the case anymore, so, I'm sorry, but I really can't look into that. |
…tor cache key" This reverts commit cc3e5a0.
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
: v1.15.3-0-g4386a6640cDescribe the current behavior
Training with some dataset triggers:
Describe the expected behavior
Training should succeed, or TensorFlow or CUDNN should expose a more actionable error
Standalone code to reproduce the issue
Will be provided after.
Other info / logs
Will be provided after. Some noisy debugging session can be seen at mozilla/DeepSpeech#3088
The text was updated successfully, but these errors were encountered: