Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 31 , 12, 2048] #3088

Closed
andrenatal opened this issue Jun 18, 2020 · 155 comments
Assignees
Labels
upstream-issue This bug is actually an upstream issue

Comments

@andrenatal
Copy link
Contributor

For support and discussions, please use our Discourse forums.

If you've found a bug, or have a feature request, then please create an issue with the following information:

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository):
    no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Linux Ubuntu 18.04
  • TensorFlow installed from (our builds, or upstream TensorFlow):
    pip
  • TensorFlow version (use command below):
    1.15
  • Python version:
    3.5
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
    10.0
  • GPU model and memory:
    4 gtx 1080 Ti
  • Exact command to reproduce:
andre@andrednn:~/projects/DeepSpeech$ more .compute_msprompts
#!/bin/bash

set -xe

#apt-get install -y python3-venv libopus0

#python3 -m venv /tmp/venv
#source /tmp/venv/bin/activate

#pip install -U setuptools wheel pip
#pip install .
#pip uninstall -y tensorflow
#pip install tensorflow-gpu==1.14

#mkdir -p ../keep/summaries

data="${SHARED_DIR}/data"
fis="${data}/LDC/fisher"
swb="${data}/LDC/LDC97S62/swb"
lbs="${data}/OpenSLR/LibriSpeech/librivox"
cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips"
npr="${data}/NPR/WAMU/sets/v0.3"

python -u DeepSpeech.py \
  --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \
  --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \
  --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \
  --train_batch_size 12 \
  --dev_batch_size 24 \
  --test_batch_size 24 \
  --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \
  --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \
  --train_cudnn \
  --n_hidden 2048 \
  --learning_rate 0.0001 \
  --dropout_rate 0.40 \
  --epochs 150 \
  --noearly_stop \
  --audio_sample_rate 8000 \
  --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr  \
  --use_allow_growth \
  --log_level 0

I'm getting the following error when using my ptbr 8khz dataset to train. Have tried to downgrade and upgrade cuda, cudnn, nvidia-drivers, and ubuntu (16 and 18) and the error persists. I have tried with datasets containing two different characteristics: 6s and 15s in length. Both contain audios in 8khz.

andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts
+ data=/data
+ fis=/data/LDC/fisher
+ swb=/data/LDC/LDC97S62/swb
+ lbs=/data/OpenSLR/LibriSpeech/librivox
+ cv=/data/mozilla/CommonVoice/en_1087h_2019-06-12/clips
+ npr=/data/NPR/WAMU/sets/v0.3
+ python -u DeepSpeech.py --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv --test_files /home/an
dre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv --train_batch_size 12 --dev_batch_size 24 --test_batch_size 24 --scorer /home/andre/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer --alphabet_config_path /home/andre/pro
jects/corpora/deepspeech-pretrained-ptbr/alphabet.txt --train_cudnn --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.40 --epochs 150 --noearly_stop --audio_sample_rate 8000 --save_checkpoint_dir /home/andre/projects/corpora/deepspeech-fulltrain-ptbr --use_allow_g
rowth --log_level 0
2020-06-18 12:30:07.508455: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-18 12:30:07.531012: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3597670000 Hz
2020-06-18 12:30:07.531588: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5178d70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-18 12:30:07.531608: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-18 12:30:07.533960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-18 12:30:09.563468: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5416390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-18 12:30:09.563492: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.563497: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.563501: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.563505: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-18 12:30:09.570577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2020-06-18 12:30:09.571728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2020-06-18 12:30:09.572862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2020-06-18 12:30:09.573993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2020-06-18 12:30:09.574226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:09.575280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-18 12:30:09.576167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-18 12:30:09.576401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-18 12:30:09.577541: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-18 12:30:09.578426: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-18 12:30:09.581112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-18 12:30:09.589736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
2020-06-18 12:30:09.589770: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:09.594742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-18 12:30:09.594757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 2 3
2020-06-18 12:30:09.594763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N Y Y Y
2020-06-18 12:30:09.594767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   Y N Y Y
2020-06-18 12:30:09.594770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2:   Y Y N Y
2020-06-18 12:30:09.594774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3:   Y Y Y N
2020-06-18 12:30:09.600428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-18 12:30:09.602038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1)
2020-06-18 12:30:09.603572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
2020-06-18 12:30:09.605112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1)
swig/python detected a memory leak of type 'Alphabet *', no destructor found.
W WARNING: You specified different values for --load_checkpoint_dir and --save_checkpoint_dir, but you are running training and testing in a single invocation. The testing step will respect --load_checkpoint_dir, and thus WILL NOT TEST THE CHECKPOINT CREATED BY THE TRAI
NING STEP. Train and test in two separate invocations, specifying the correct --load_checkpoint_dir in both cases, or use the same location for loading and saving.
2020-06-18 12:30:10.102127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2020-06-18 12:30:10.103272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2020-06-18 12:30:10.104379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2020-06-18 12:30:10.105484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2020-06-18 12:30:10.105521: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:10.105533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-18 12:30:10.105562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-18 12:30:10.105574: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-18 12:30:10.105586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-18 12:30:10.105597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-18 12:30:10.105610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-18 12:30:10.114060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
W0618 12:30:10.218584 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
W0618 12:30:10.218781 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
W0618 12:30:10.218892 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0618 12:30:10.324707 139639980619584 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dt
ype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f
uture version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype i
s deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will
be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 2 3
2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N Y Y Y
2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   Y N Y Y
2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2:   Y Y N Y
2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3:   Y Y Y N
2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca
pability: 6.1)
2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca
pability: 6.1)
2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca
pability: 6.1)
2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca
pability: 6.1)
D Session opened.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                             2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Epoch 0 |   Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303                                                                                                                                                                                                   2
020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param
s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w
orkspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni
ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
Traceback (most recent call last):
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]]
         [[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
    absl.app.run(main)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
    train()
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set
    feed_dict=feed_dict)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
         [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
    absl.app.run(main)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
    train()

  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
@DanBmh
Copy link
Contributor

DanBmh commented Jun 19, 2020

I also did get similar errors lately. In my case it often occurs at the end of an epoch.
Training works normally for a few epochs before i get the error. Mine has some different numbers than yours:

Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 1101, 30, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
	 [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]

Reducing the batch size helped me to get this error later in the training, this may be a workaround you can try.

@andrenatal
Copy link
Contributor Author

andrenatal commented Jun 19, 2020 via email

@kdavis-mozilla
Copy link
Contributor

@andrenatal What version of CuDNN version you are using? Currently with TensorFlow 1.15 depends on CUDA 10.0 and CuDNN v7.6.

@andrenatal
Copy link
Contributor Author

andrenatal commented Jun 23, 2020

I tried all versions that @reuben suggested, including CuDNN 7.6

@lissyx
Copy link
Collaborator

lissyx commented Jun 30, 2020

@andrenatal I know you already tested a lot of things, but this forum entry is interesting: https://forums.developer.nvidia.com/t/gpu-crashes-when-running-machine-learning-models/108252

  • the error message is the same
  • it's on gtx 1080 ti
  • issue seems to be related to Python 3.6

Can you give it a spin with Python 3.7 ?

@Shilpil
Copy link

Shilpil commented Jul 6, 2020

We tried running it with Python 3.7 but we faced the same error.

@lissyx
Copy link
Collaborator

lissyx commented Jul 6, 2020

We tried running it with Python 3.7 but we faced the same error.

Then I'm sorry but the only way to get something actionable is bisecting on the dataset to identify the offending files and debug from there.

@applied-machinelearning
Copy link

applied-machinelearning commented Jul 6, 2020

@lissyx
As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

But because if it fails it always consistently fails on the same step and thus batch I tried to isolate stuff.
I know have a small subset of my large dataset and that always fails on epoch 27 with batch size 32, so it's under 1500 samples and thus manageable in size.

I made some discoveries though:

  • Training with CPUhas succeeded in the past (not repeated yet on this small dataset).
  • Training with GPU with batch size 1 has succeeded in the past (not repeated yet on this small dataset).
  • If it fails, it always consistently fails on the same step.
  • So I tried with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).

So it seems that the combination (and probably order) of certain samples in a batch blows up with CUDNN consistently
(and in any other combination or order, they don't).

I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.
(and provide the docker build script, run script, logging and the patches I applied to the v0.7.4 tree (only the printing of files in the batches and replacing the sort with the shuffle).

@lissyx
Copy link
Collaborator

lissyx commented Jul 6, 2020

I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.

If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ...

@lissyx
Copy link
Collaborator

lissyx commented Jul 6, 2020

* So I tried  with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).

It would still be interesting if you could share the order when it works, when it fails, and where it fails.

@applied-machinelearning
Copy link

I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.

If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ...

Merely reduced the problem-space, not of the tensorflow / deepspeech internals.
And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?
(I ask, since the whole chain of cuda 10, tensorflow 1.15 etc. is probably unsupported by Nvidia as well, so we probably won't get any support from that side as well. And as there are several people now reporting issues with training on current deepspeech in this thread ...)

@applied-machinelearning
Copy link

applied-machinelearning commented Jul 6, 2020

* So I tried  with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).

It would still be interesting if you could share the order when it works, when it fails, and where it fails.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

@lissyx
Copy link
Collaborator

lissyx commented Jul 6, 2020

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

@reuben Had a look at that, he knows better.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

I think you should need to share audio + csv

Merely reduced the problem-space, not of the tensorflow / deepspeech internals.
And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).

@reuben
Copy link
Contributor

reuben commented Jul 6, 2020

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

Lots

@applied-machinelearning
Copy link

applied-machinelearning commented Jul 6, 2020

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

Lots
That is unfortunate.

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

@reuben Had a look at that, he knows better.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

I think you should need to share audio + csv

OK, will do.

Merely reduced the problem-space, not of the tensorflow / deepspeech internals.
And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).

OK, I will do some more experiments then, try to pinpoint it some more.
Try to find out if only the batch content matters, or also the state the graph /weights are in from the previous steps.
If only the batch content matters, I will test what happens if you only shuffle that.

@applied-machinelearning
Copy link

applied-machinelearning commented Jul 7, 2020

@lissyx @reuben

Got the results of my extended testing based on a minimalistic dataset of 3x 32 samples, as I use a batchsize of 32, that is 3 steps. I named the batches A, B and C and as a whole they are ordered by wav_filesize.

I have done runs with all sorts of combinations of these batches (concatenated in the order of the name of the csv file), if appended with a "s" that batch in itself is still ordered by wav_filesize, if appended with a "r" that batch is randomly shuffled. The runs do 3 epochs.

In the tar.gz file I included:

  • data dir with all the csv's used and subdirs A, B and C with the corresponding wav files.
  • log dir with all the log files that came out of this test run.
  • docker dir with the build file for the docker image (which is a slight adaptation of the one in de DeepSpeech repo).
  • patches dir with the two patches I applied on to v0.7.4, one to print out the filenames in a batch and the other one to keep the sorting of the train csv's as is.

As a summary of the results:

train_debug_Ar_Br_Cr.csv, blows up in step 1, which is batch B
train_debug_Ar_Br_Cs.csv, blows up in step 1, which is batch B
train_debug_Ar_Bs_Cs.csv, blows up in step 1, which is batch B
train_debug_As_Br_Cs.csv, blows up in step 1, which is batch B
train_debug_As_Bs_Cs.csv, blows up in step 1, which is batch B
train_debug_As_Cs_Bs.csv, blows up in step 2, which is batch B
train_debug_As_Cs.csv, OK
train_debug_Bs_Cs.csv, OK
train_debug_Cs_As_Bs.csv, blows up in step 2, which is batch B
train_debug_Cs_As.csv, OK
train_debug_Cs_Bs_As.csv, blows up in step 1, which is batch B
train_debug_Cs_Bs.csv, blows up in step 1, which is batch B
train_debug_interbatch_random: All variants: OK

My interpretation of these results:

  1. If it blows up, it is always at a step with batch B.
  2. It always blows up with the contents of this batch B, unless batch B is the very first step.
  3. The order of the files within batch B doesn't matter.
  4. It happens independent of the previous batches/steps (with the exemption of B being the first batch)
  5. All inter-batch randomized variants run fine.

But what is so special about the content of batch B that it blows up with CUDNN ...

(before you ask, it is not only this batch B, there are multiple such batches in my large datasets, this is one example with the shortest samples)
deepspeech_v0.7.4_cudnn_debug.tar.gz

@lissyx
Copy link
Collaborator

lissyx commented Jul 7, 2020

Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ?

@applied-machinelearning
Copy link

Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ?

I could try by reducing the training batch size, see if I can find even smaller batches that fail (from previous tests I think it will end at either 2 or 4 (but not 1), will give it a try tomorrow.

@lissyx
Copy link
Collaborator

lissyx commented Jul 8, 2020

As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :)

That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?

@applied-machinelearning
Copy link

@lissyx @reuben

OK I have done some more runs:

I ran train_debug_As_Bs_Cs.csv with batch sizes 1 and 2:

Batch size 1 trains fine.
Batch size 2 blows up on the step with files:
B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav
B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav

So I made some new csv files with:

batch A: two files from the original batch A
batch B: two files B/98_2923 and B/154_4738 from batch B
batch C: two files from the original batch C

And I made some variant of that:

train_debug_mini_As_Bs_Cs.csv
train_debug_mini_Bs_As_Cs.csv
train_debug_mini_Bs_As_Cs_B_swapped.csv
train_debug_mini_As_Bs_Cs_B_swapped.csv
train_debug_mini_As_Bs_Cs_B_mixed_A.csv
train_debug_mini_As_Bs_Cs_B_mixed_C.csv
train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv
train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv

The results of that:

With batch size 1, these all workout fine (as expected).
With batch size 2:
train_debug_mini_As_Bs_Cs.csv
    blows up in step 1, which is batch B.

train_debug_mini_As_Bs_Cs_B_swapped.csv
    blows up in step 1, which is batch B, so swapping the order within B doesn't make a difference.

train_debug_mini_Bs_As_Cs.csv
    works fine, B is the first step 0.
    as expected as the first step seems to be a special case.

train_debug_mini_Bs_As_Cs_B_swapped.csv
    works fine, B is the first step 0, so swapping the order in B doesn't make a difference.
    as expected as the first step seems to be a special case.

train_debug_mini_As_Bs_Cs_B_mixed_A.csv
    blows up in step 1, which is:
        A/155_4757
        B/154_4738

train_debug_mini_As_Bs_Cs_B_mixed_C.csv
    blows up in step 1, which is:
        B/98_2923
        C/169_5271

train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv
    blows up in step 1, which is:
        C/169_5271
        B/98_2923

train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv
    blows up in step 2, which is:
        B/98_2923
        C/169_5271

    while it did complete step 1, which is:
        B/154_4738
        C/175_5429

My interpretation of this all:

  • batch size 1 always works, so it is not completly file specific
  • with batch size 2 both B/98_2923 and B/154_4738 appear in blowups.
  • with batch size 2 B/154_4738 appears in both a blowup and a succeeded step.
  • from the previous expiriments we know that when you mix batch B in a much larger pool of (more different) files, all works out well.

So it is a bit odd, I'm starting to wonder if this is some edge case where we hit some math operation blowing up.
But both files from B have slightly different file sizes and both blow up in combinations with other files with slightly different file sizes (from A and C).

So I'm a bit lost now, you have more insight in how things get processed, hopefully you have some more ideas based on that.

CSV's and logs are attached (sample files from the previous post can be used)
train_debug_mini.tar.gz

@applied-machinelearning
Copy link

As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :)
I think it was an Italian DS/CV repo I drew inspiration from, but they probably took it from the French one ;).
Previously I also tried with a docker build with ubuntu18.04-cuda10 image as a base, with tensorflow-gpu 1.15.3.

That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?

Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian.
Host Nvidia driver is now 440.100 (but I have tried several others, still the same problems).
If you need more specifics, please indicate what info you need more.

Thanks for looking into it !

@lissyx
Copy link
Collaborator

lissyx commented Jul 8, 2020

As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :)
I think it was an Italian DS/CV repo I drew inspiration from, but they probably took it from the French one ;).
Previously I also tried with a docker build with ubuntu18.04-cuda10 image as a base, with tensorflow-gpu 1.15.3.

That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?

Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian.
Host Nvidia driver is now 440.100 (but I have tried several others, still the same problems).
If you need more specifics, please indicate what info you need more.

Thanks for looking into it !

Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.

@applied-machinelearning
Copy link

applied-machinelearning commented Jul 9, 2020

Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.

I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff.
So attached is the script I used in the docker container to run the tests. Feature cache, checkpoint dir etc, all get cleaned up before the run.

run_deepspeech_var_batchsize.sh.tar.gz

I hope you can reproduce and spot something !

@lissyx
Copy link
Collaborator

lissyx commented Jul 9, 2020

Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.

I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff.
So attached is the script I used in the docker container to run the tests. Feature cache, checkpoint dir etc, all get cleaned up before the run.

run_deepspeech_var_batchsize.sh.tar.gz

I hope you can reproduce and spot something !

Looks like clean.sh is missing, as well as FATAL Flags parsing error: flag --alphabet_config_path=./data/lm/plaintext_alpha.txt: The file pointed to by --alphabet_config_path must exist and be readable. . I don't want to sound rude, but if you could just assemble a dump-proof Docker or script to repro minimally the issue, there are already enough complexity and variables interacting, I really need to be 1000% sure to repro your exact step to assert whether I can reproduce the issue :/

@lissyx
Copy link
Collaborator

lissyx commented Jul 9, 2020

I'm not even able to get CUDA working so far in the dockerfile :/

@lissyx
Copy link
Collaborator

lissyx commented Jul 9, 2020

I'm not even able to get CUDA working so far in the dockerfile :/

Seems to be the same old weird nvidia/cuda/docker bug, after ldconfig it works:

tf-docker ~ > sudo ldconfig
tf-docker ~ > nvidia-smi 
Thu Jul  9 10:16:44 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   34C    P8     1W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:4B:00.0 Off |                  N/A |
|  0%   35C    P8    20W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
tf-docker ~ > python -c "import tensorflow as tf; tf.test.is_gpu_available()"
2020-07-09 10:16:48.233166: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-09 10:16:48.264242: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900325000 Hz
2020-07-09 10:16:48.271101: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d55f00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-09 10:16:48.271144: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-09 10:16:48.272884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-09 10:16:54.029647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.046529: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.047194: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d58840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-09 10:16:54.047218: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-09 10:16:54.047253: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-09 10:16:54.047656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.048468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:21:00.0
2020-07-09 10:16:54.048551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.049324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:4b:00.0
2020-07-09 10:16:54.049585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-09 10:16:54.057643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-09 10:16:54.061562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-09 10:16:54.066658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-09 10:16:54.077684: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-09 10:16:54.081287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-09 10:16:54.107985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-09 10:16:54.108254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.109206: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.110043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.110885: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.111644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1
2020-07-09 10:16:54.111707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-09 10:16:54.113783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-09 10:16:54.113802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 
2020-07-09 10:16:54.113811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N N 
2020-07-09 10:16:54.113821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   N N 
2020-07-09 10:16:54.113979: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.114808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.115627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.116444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5)
2020-07-09 10:16:54.117023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.117508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:1 with 10311 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5)

@lissyx
Copy link
Collaborator

lissyx commented Jul 9, 2020

@applied-machinelearning Good news, I repro your issue.

@lissyx
Copy link
Collaborator

lissyx commented Jul 9, 2020

@applied-machinelearning Not only I repro, but apt update && apt upgrade changes the issue: first it was exploding at epoch 1, now at epoch 2.

@lissyx
Copy link
Collaborator

lissyx commented Jul 9, 2020

Several people report similar issue with NVIDIA drivers above a certain version: tensorflow/tensorflow#35950 (comment), and 431.36 would be a working one.

@applied-machinelearning
Copy link

applied-machinelearning commented Aug 5, 2020

I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ?
At least know if it's a good thing to debug people with that or if we are creating underlying issues.

(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.

Sure, but it's not in our hands nor in the hands of people who will review the PR, there's a policy and they might have their hands tied.

That's true, perhaps my dutch heritage that policies are nice when they make sense ;)
I'm also fascinated by the little help you get to get the requested test implemented, essentially blocking the patch,
most opensource communities I have encountered so far are happy when you fix or even pinpoint (a long standing) bug.

Back to the environment var:
If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.

Exactly.

By the way, I'm wondering do you know how often do we still use the cached version on your larger dataset test ?
The difference of 20 seconds is so small, that either:

  • DS doesn't get to use a cached version very often on a real life dataset
  • Caching isn't very effective and about equal in cost to not caching

@lissyx
Copy link
Collaborator

lissyx commented Aug 5, 2020

That's true, perhaps my dutch heritage that policies are nice when they make sense ;)

Well, even fixing ruy computation on just-released r2.2 was not taken and only merged on master

@lissyx
Copy link
Collaborator

lissyx commented Aug 6, 2020

I'm also fascinated by the little help you get to get the requested test implemented, essentially blocking the patch,
most opensource communities I have encountered so far are happy when you fix or even pinpoint (a long standing) bug.

Well, I can understand why they want that, I guess in their position I'd do the same. Looks like things are moving now, I hope this can go into a 1.15.4 or in the worst case, we need statement on the consequences of the flag.

@lissyx
Copy link
Collaborator

lissyx commented Aug 7, 2020

The fix landed upstream: tensorflow/tensorflow#41832

@lissyx
Copy link
Collaborator

lissyx commented Aug 18, 2020

We still have no feedback whether a 1.15.4 can be issued for that.

@applied-machinelearning

Perhaps we should try to stage it as a multi-stage rocket:

  1. First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.
  2. Then try to get a release for that branch.
  3. If we don't get a release, we could try to get it applied to mozilla-tensorflow.
  4. And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

  • First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.

  • Then try to get a release for that branch.

(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4

If we don't get a release, we could try to get it applied to mozilla-tensorflow.

What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can

And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?

Same, that requires us to build and support TensorFlow wheel, which is a lot of work.

@applied-machinelearning
  • First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.
  • Then try to get a release for that branch.

(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4

If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.

If we don't get a release, we could try to get it applied to mozilla-tensorflow.

What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can

And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?

Same, that requires us to build and support TensorFlow wheel, which is a lot of work.

Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15
branch that doesn't require very much (or even any) rebuilding since nothing changes. And the question is if you should build for every target. If it's the most common, x86 and only the python version from the ubuntu cuda dev image, it is all fairly limited put provides for the common training case.

@lissyx
Copy link
Collaborator

lissyx commented Aug 24, 2020

If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.

Then maybe they are considering a 1.15.4 ?

Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15
branch that doesn't require very much (or even any) rebuilding since nothing changes. And the question is if you should build for every target. If it's the most common, x86 and only the python version from the ubuntu cuda dev image, it is all fairly limited put provides for the common training case.

You are highly underestimating:

  • the amount of work it requires to ship a tensorflow release, especially it requires quite a lot of CI-related changes
  • the amount of work we can take realistically in the current context

Just building r1.15 for the purpose of those debugging steps took several local hacks. Re-using TensorFlow's CI Docker stuff also required a non trivial amount of work.

@andrenatal
Copy link
Contributor Author

andrenatal commented Aug 30, 2020

I confirm that the flag addressed my issues and that managed me to train and have a fully functioning model.

@lissyx
Copy link
Collaborator

lissyx commented Sep 21, 2020

There has been quite a lot of activity on r1.15 branch on TensorFlow, I think we can safely hope for a 1.15.4 that ships without fix now (current upstream r1.15 has merged the fix). I'll close this issue when 1.15.4 ships.

lissyx pushed a commit to lissyx/STT that referenced this issue Sep 25, 2020
@lissyx lissyx closed this as completed in 16165f3 Sep 25, 2020
lissyx added a commit that referenced this issue Sep 25, 2020
Fix #3088: Use TensorFlow 1.15.4 with CUDNN fix
lissyx added a commit that referenced this issue Sep 25, 2020
Fix #3088: Use TensorFlow 1.15.4 with CUDNN fix
@DanBmh
Copy link
Contributor

DanBmh commented Oct 1, 2020

Still not working for me with up to date master and newly created docker container.
But as mentioned somewhere above, running export TF_CUDNN_RESET_RND_GEN_STATE=1 solved my problem.

@lissyx
Copy link
Collaborator

lissyx commented Oct 1, 2020

Still not working for me with up to date master and newly created docker container.

Can you triple check if you run 1.15.4 ?

But as mentioned somewhere above, running export TF_CUDNN_RESET_RND_GEN_STATE=1 solved my problem.

Maybe there are some other bugs. As you can see, it was quite painful to investigate already even with a small repro dataset. I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future.

@DanBmh
Copy link
Contributor

DanBmh commented Oct 1, 2020

Can you triple check if you run 1.15.4 ?

Running python3 -c 'import tensorflow as tf; print(tf.__version__)' gives me exactly 1.15.4.

I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future.

No problem for me, the solution is easy, so I just will add the extra flag everywhere.


Not sure this helps, but for me the error always gets thrown in validation phase, the first training epoch is finishing without errors.
This also happens if I switch train and dev datasets. So I don't think the problem lies in the dataset here.

@lissyx
Copy link
Collaborator

lissyx commented Oct 1, 2020

I think @applied-machinelearning mentionned something like that on upstream issue ?

@applied-machinelearning
Copy link

Yeah it is still on my todo list, but I also still have seen the error at least once.
I think you can still have a cache hit while other stuff in the descriptor still differs (from memory .. , I thought rnn_mode was a likely candidate).

I think the pattern for this is when you have the same sequence lengths etc. in both train and dev set. Should be easy testable (just use the same csv (and keep the ordering the same) for both train and dev datasets), but I haven't come around to actually do it. I hope to get to testing this tomorrow or this weekend.

Still wondering if the whole caching idea doesn't do more harm than good.
It seems error prone, and if you need to check everything element the cost for checking each time seems non-negligible (as your test seemed to indicate where you didn't find that much difference in training times
with or without the TF_CUDNN_RESET_RND_GEN_STATE env var.

Unfortunately there was no reaction from the nvidia guy, seems like it is needed to open a new report. I will after testing.

But perhaps it is still a good idea to implement setting the environment var from deepspeech training code any way ?
As I don't think there will be a Tensorflow (1.15.5) release any time soon and most certainly not before a probable deepspeech 1.0 release.

@applied-machinelearning
Copy link

Hmm unfortunately I can't reproduce with what I thought could trigger it (run training and validation on the same sorted by wav_size csv's). :(

@piraka9011
Copy link
Contributor

piraka9011 commented Oct 24, 2020

It was very interesting following this thread! Learned a lot!
Wanted to confirm that the suggested fix works:
System Specs: Ubuntu 18.04, Nvidia Driver 410.104, Cuda 10.0, CUDNN 7.6.5, Ryzen 3700x, Nvidia GTX 1080

Added export TF_CUDNN_RESET_RND_GEN_STATE=1 and made my training batch size divisible by the number of training samples.

I didn't notice any significant loss in performance.

Edit: I am using the tensorflow/tensorflow:1.15.4-gpu-py3 Docker image as well

@lissyx
Copy link
Collaborator

lissyx commented Oct 24, 2020

It was very interesting following this thread! Learned a lot!
Wanted to confirm that the suggested fix works:
System Specs: Ubuntu 18.04, Nvidia Driver 410.104, Cuda 10.0, CUDNN 7.6.5, Ryzen 3700x, Nvidia GTX 1080

Added export TF_CUDNN_RESET_RND_GEN_STATE=1 and made my training batch size divisible by the number of training samples.

I didn't notice any significant loss in performance.

You should not need those with TensorFlow 1.15.4``

@gauravgund
Copy link

Reducing the number of batch size from 64 to 32 for training and 32 to 16 for test and dev data solved this issue.

@lissyx
Copy link
Collaborator

lissyx commented Apr 2, 2021

Unfortunately, I am witnessing by myself that the fix does not cover all the cases: TF_CUDNN_RESET_RND_GEN_STATE=1 required to train a DeepSpeech v0.9.3 model (batch size 8, K40m GPU with 12GB VRAM) on Breton release of Common Voice v6.1 cc @ftyers

@cesmile
Copy link

cesmile commented Oct 20, 2021

I also witnessed this. And, I found it's related with Computer Memory usage of python.
I have 12 GB graphic card, and if python uses more than 12GB, the error occurs.
You can see Memory usage of python in "Task Manager " of windows.
So, I reduced my batch size, to reduce Memory usage. And used TF_CUDNN_RESET_RND_GEN_STATE=1 to solve the problem.
Hope this can help to figure out the problem.

@wasertech
Copy link

Had a similar issue when training a small model for romansh (<15h). Turns out lowering the batch size wasn't enough. (TF_CUDNN_RESET_RND_GEN_STATE=1 was already set as I'm using a docker image)
My fix was to lower top_k for the lm to be smaller than the max amount of unique words. (from 500 000 to 10 500 in my case 🥲 )

Hope this can help someone stuck with this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream-issue This bug is actually an upstream issue
Projects
None yet
Development

No branches or pull requests