Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train DeepSpeech on Tuda-De dataset. #3210

Closed
AASHISHAG opened this issue Aug 2, 2020 · 4 comments
Closed

Unable to train DeepSpeech on Tuda-De dataset. #3210

AASHISHAG opened this issue Aug 2, 2020 · 4 comments

Comments

@AASHISHAG
Copy link

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (our builds, or upstream TensorFlow): pip install tensorflow-gpu==1.15.2
  • TensorFlow version (use command below): v1.15.0-92-g5d80e1e 1.15.2
  • Python version: Python 3.6.10 :: Anaconda, Inc.
  • CUDA/cuDNN version: CUDA Version 10.0.130

Thank you for the great repo.

I am trying to train German deepspeech model. I am using pre-processing scripts from the bin folder and able to train the model sucesssfully on Common Voice and Mailabs dataset. However, when I try to train the model on Tuda-De dataset, I am getting below exceptions;
Could you please help to fix the issue?

(deepspeech_v0.7.4) agarwal@LTLab.lan@wika:~/deepspeech_v0.7.4$ python DeepSpeech.py --train_files ../german-speech-corpus/tuda-de/data_prepared_mozilla_v0.7.4/tuda-v2-train.csv --dev_files ../german-speech-corpus/tuda-de/data_prepared_mozilla_v0.7.4/tuda-v2-dev.csv --test_files ../german-speech-corpus/tuda-de/data_prepared_mozilla_v0.7.4/tuda-v2-test.csv --alphabet_config_path ../dependencies_v0.7.4/swiss-german/alphabet.txt --scorer ../dependencies_v0.7.4/swiss-german/kenlm.scorer --test_batch_size 36 --train_batch_size 24 --dev_batch_size 36 --epochs 30 --learning_rate 0.0001 --dropout_rate 0.25 --early_stop True --es_epochs 5 --train_cudnn --checkpoint_dir checkpoints_experiments2/tmp/
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:16 | Steps: 1 | Loss: 191.423950                                                                                             Traceback (most recent call last):
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 61, 24, 2048]
         [[{{node tower_0/cudnn_lstm/CudnnRNNV3_2}}]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 61, 24, 2048]
         [[{{node tower_0/cudnn_lstm/CudnnRNNV3_2}}]]
         [[tower_0/CTCLoss/_115]]
0 successful operations.
2 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 955, in run_script
    absl.app.run(main)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 927, in main
    train()
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 595, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 560, in run_set
    feed_dict=feed_dict)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 61, 24, 2048]
         [[node tower_0/cudnn_lstm/CudnnRNNV3_2 (defined at /home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 61, 24, 2048]
         [[node tower_0/cudnn_lstm/CudnnRNNV3_2 (defined at /home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[tower_0/CTCLoss/_115]]
0 successful operations.
2 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_2':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 955, in run_script
    absl.app.run(main)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 927, in main
    train()
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 473, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 312, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 239, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 190, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/media/data/LTLab.lan/agarwal/deepspeech_v0.7.4/training/deepspeech_training/train.py", line 128, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
@lissyx
Copy link
Collaborator

lissyx commented Aug 2, 2020

tensorflow/tensorflow#41630 (comment)

Can you verify with that env variable if it works ?

@lissyx
Copy link
Collaborator

lissyx commented Aug 2, 2020

@AASHISHAG see above

@AASHISHAG
Copy link
Author

@lissyx : Setting TF_CUDNN_RESET_RND_GEN_STATE=1 resolved the issue.

Thank you, I am closing the ticket.

@lissyx
Copy link
Collaborator

lissyx commented Aug 2, 2020

Thanks, it means there seems to be much more impact we anticipated / experienced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants