Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when files with same size are in first batch #3327

Closed
shan18 opened this issue Sep 18, 2020 · 5 comments
Closed

Error when files with same size are in first batch #3327

shan18 opened this issue Sep 18, 2020 · 5 comments

Comments

@shan18
Copy link

shan18 commented Sep 18, 2020

  • Have I written custom code: No
  • OS Platform and Distribution: Linux Ubuntu 18.04
  • TensorFlow installed from: upstream TensorFlow ($ pip install tensorflow-gpu==1.15.2)
  • TensorFlow version: 1.15.2
  • Python version: 3.6.9
  • CUDA/cuDNN version: 10.0/7.6.3
  • GPU model and memory: Tesla V100 (16 GB)
  • Exact command to reproduce: python DeepSpeech.py --train_files data/train.csv --train_batch_size 2 --train_cudnn

While training the model using the command shown above, I am getting a strange error that comes only in some particular cases.

The error message

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 187, 2, 2048] 
	 [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
	 [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 187, 2, 2048] 
	 [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 1005, in <module>
    run_script()
  File "DeepSpeech.py", line 1002, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 974, in main
    train()
  File "DeepSpeech.py", line 642, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 607, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 187, 2, 2048] 
	 [[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
  (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 187, 2, 2048] 
	 [[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3':
  File "DeepSpeech.py", line 1005, in <module>
    run_script()
  File "DeepSpeech.py", line 1002, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 974, in main
    train()
  File "DeepSpeech.py", line 520, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 314, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 241, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "DeepSpeech.py", line 192, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "DeepSpeech.py", line 130, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

To reproduce the issue, I disabled the sorting of input samples so that I can pinpoint the exact batch which is causing the issue. Here are my findings:

Case 1

Contents of train.csv

wav_filename,wav_filesize,transcript
audio/1.wav,69198,he is becoming strong day by day
audio/2.wav,69198,she is a good woman
audio/3.wav,120398,i am working in sales and marketing
audio/4.wav,69198,what i do please suggest me
audio/5.wav,69198,my question is madam

The first batch has files that are all of the same sizes. Since the batch size is 2, the model will receive 1.wav and 2.wav in the first step. This configuration of train.csv throws the error above.

Case 2

Now let's rearrange the contents on train.csv to

wav_filename,wav_filesize,transcript
audio/3.wav,120398,i am working in sales and marketing
audio/1.wav,69198,he is becoming strong day by day
audio/2.wav,69198,she is a good woman
audio/4.wav,69198,what i do please suggest me
audio/5.wav,69198,my question is madam

The first batch will have samples of different sizes (3.wav and 1.wav). Surprisingly, this case does not throw an error. The model trains without any issues.

Can anyone help me in understanding what might be causing this issue? Or is this a bug?

@lissyx
Copy link
Collaborator

lissyx commented Sep 18, 2020

Have you tried the documented workaround from #3088 ?
It matches your exact error message and experiments ...

@shan18
Copy link
Author

shan18 commented Sep 18, 2020

I checked the thread. From what I could understand, the workaround is experimenting with different versions of tensorflow + nvidia driver + cuda + cudnn and check what's best for your particular GPU right? according to the experiments mentioned here #3088 (comment)

@lissyx
Copy link
Collaborator

lissyx commented Sep 18, 2020

I checked the thread. From what I could understand, the workaround is experimenting with different versions of tensorflow + nvidia driver + cuda + cudnn and check what's best for your particular GPU right? according to the experiments mentioned here #3088 (comment)

No, the workaround is an env variable until the proper fix is released by TensorFlow: #3088 (comment)

@shan18
Copy link
Author

shan18 commented Sep 18, 2020

I tried this and it worked. Thanks a lot 👍

@shan18 shan18 closed this as completed Sep 18, 2020
@lissyx
Copy link
Collaborator

lissyx commented Sep 18, 2020

Thanks. TensorFlow team merged the r1.15 PR a few hours ago, so hopefully they are making 1.15.4 and we will be able to bump our dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants