New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 31 , 12, 2048] #3088
Comments
I also did get similar errors lately. In my case it often occurs at the end of an epoch.
Reducing the batch size helped me to get this error later in the training, this may be a workaround you can try. |
I have tried reducing the batch size, but to no avail.
…On Fri, Jun 19, 2020, 5:38 AM DanBmh ***@***.***> wrote:
I also did get similar errors lately. In my case it often occurs at the
end of an epoch.
Training works normally for a few epochs before i get the error. Mine has
some different numbers than yours:
Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 1101, 30, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
Reducing the batch size helped me to get this error later in the training,
this may be a workaround you can try.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3088 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHNUTGIYIGEXRHOSPTIC6TRXNL4TANCNFSM4OCC3PYQ>
.
|
@andrenatal What version of CuDNN version you are using? Currently with TensorFlow 1.15 depends on CUDA 10.0 and CuDNN v7.6. |
I tried all versions that @reuben suggested, including CuDNN 7.6 |
@andrenatal I know you already tested a lot of things, but this forum entry is interesting: https://forums.developer.nvidia.com/t/gpu-crashes-when-running-machine-learning-models/108252
Can you give it a spin with Python 3.7 ? |
We tried running it with Python 3.7 but we faced the same error. |
Then I'm sorry but the only way to get something actionable is bisecting on the dataset to identify the offending files and debug from there. |
@lissyx But because if it fails it always consistently fails on the same step and thus batch I tried to isolate stuff. I made some discoveries though:
So it seems that the combination (and probably order) of certain samples in a batch blows up with CUDNN consistently I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up. |
If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ... |
It would still be interesting if you could share the order when it works, when it fails, and where it fails. |
Merely reduced the problem-space, not of the tensorflow / deepspeech internals. Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ? |
What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ? |
@reuben Had a look at that, he knows better.
I think you should need to share audio + csv
Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19). |
Lots |
OK, will do.
OK, I will do some more experiments then, try to pinpoint it some more. |
Got the results of my extended testing based on a minimalistic dataset of 3x 32 samples, as I use a batchsize of 32, that is 3 steps. I named the batches A, B and C and as a whole they are ordered by wav_filesize. I have done runs with all sorts of combinations of these batches (concatenated in the order of the name of the csv file), if appended with a "s" that batch in itself is still ordered by wav_filesize, if appended with a "r" that batch is randomly shuffled. The runs do 3 epochs. In the tar.gz file I included:
As a summary of the results:
My interpretation of these results:
But what is so special about the content of batch B that it blows up with CUDNN ... (before you ask, it is not only this batch B, there are multiple such batches in my large datasets, this is one example with the shortest samples) |
Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ? |
I could try by reducing the training batch size, see if I can find even smaller batches that fail (from previous tests I think it will end at either 2 or 4 (but not 1), will give it a try tomorrow. |
So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :) That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related? |
OK I have done some more runs: I ran train_debug_As_Bs_Cs.csv with batch sizes 1 and 2:
So I made some new csv files with:
And I made some variant of that:
The results of that:
My interpretation of this all:
So it is a bit odd, I'm starting to wonder if this is some edge case where we hit some math operation blowing up. So I'm a bit lost now, you have more insight in how things get processed, hopefully you have some more ideas based on that. CSV's and logs are attached (sample files from the previous post can be used) |
Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian. Thanks for looking into it ! |
Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro. |
I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff. run_deepspeech_var_batchsize.sh.tar.gz I hope you can reproduce and spot something ! |
Looks like |
I'm not even able to get CUDA working so far in the dockerfile :/ |
Seems to be the same old weird nvidia/cuda/docker bug, after
|
@applied-machinelearning Good news, I repro your issue. |
@applied-machinelearning Not only I repro, but |
Several people report similar issue with NVIDIA drivers above a certain version: tensorflow/tensorflow#35950 (comment), and 431.36 would be a working one. |
That's true, perhaps my dutch heritage that policies are nice when they make sense ;)
By the way, I'm wondering do you know how often do we still use the cached version on your larger dataset test ?
|
Well, even fixing ruy computation on just-released r2.2 was not taken and only merged on master |
Well, I can understand why they want that, I guess in their position I'd do the same. Looks like things are moving now, I hope this can go into a 1.15.4 or in the worst case, we need statement on the consequences of the flag. |
The fix landed upstream: tensorflow/tensorflow#41832 |
We still have no feedback whether a 1.15.4 can be issued for that. |
Perhaps we should try to stage it as a multi-stage rocket:
|
(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4
What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can
Same, that requires us to build and support TensorFlow wheel, which is a lot of work. |
If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.
Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15 |
Then maybe they are considering a 1.15.4 ?
You are highly underestimating:
Just building r1.15 for the purpose of those debugging steps took several local hacks. Re-using TensorFlow's CI Docker stuff also required a non trivial amount of work. |
I confirm that the flag addressed my issues and that managed me to train and have a fully functioning model. |
There has been quite a lot of activity on r1.15 branch on TensorFlow, I think we can safely hope for a 1.15.4 that ships without fix now (current upstream r1.15 has merged the fix). I'll close this issue when 1.15.4 ships. |
Fix #3088: Use TensorFlow 1.15.4 with CUDNN fix
Fix #3088: Use TensorFlow 1.15.4 with CUDNN fix
Still not working for me with up to date master and newly created docker container. |
Can you triple check if you run 1.15.4 ?
Maybe there are some other bugs. As you can see, it was quite painful to investigate already even with a small repro dataset. I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future. |
Running
No problem for me, the solution is easy, so I just will add the extra flag everywhere. Not sure this helps, but for me the error always gets thrown in validation phase, the first training epoch is finishing without errors. |
I think @applied-machinelearning mentionned something like that on upstream issue ? |
Yeah it is still on my todo list, but I also still have seen the error at least once. I think the pattern for this is when you have the same sequence lengths etc. in both train and dev set. Should be easy testable (just use the same csv (and keep the ordering the same) for both train and dev datasets), but I haven't come around to actually do it. I hope to get to testing this tomorrow or this weekend. Still wondering if the whole caching idea doesn't do more harm than good. Unfortunately there was no reaction from the nvidia guy, seems like it is needed to open a new report. I will after testing. But perhaps it is still a good idea to implement setting the environment var from deepspeech training code any way ? |
Hmm unfortunately I can't reproduce with what I thought could trigger it (run training and validation on the same sorted by wav_size csv's). :( |
It was very interesting following this thread! Learned a lot! Added I didn't notice any significant loss in performance. Edit: I am using the |
You should not need those with TensorFlow 1.15.4`` |
Reducing the number of batch size from 64 to 32 for training and 32 to 16 for test and dev data solved this issue. |
Unfortunately, I am witnessing by myself that the fix does not cover all the cases: |
I also witnessed this. And, I found it's related with Computer Memory usage of python. |
Had a similar issue when training a small model for romansh (<15h). Turns out lowering the batch size wasn't enough. ( Hope this can help someone stuck with this error. |
For support and discussions, please use our Discourse forums.
If you've found a bug, or have a feature request, then please create an issue with the following information:
no
Linux Ubuntu 18.04
pip
1.15
3.5
10.0
4 gtx 1080 Ti
I'm getting the following error when using my ptbr 8khz dataset to train. Have tried to downgrade and upgrade cuda, cudnn, nvidia-drivers, and ubuntu (16 and 18) and the error persists. I have tried with datasets containing two different characteristics: 6s and 15s in length. Both contain audios in 8khz.
The text was updated successfully, but these errors were encountered: