Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing 1-length special tokens cut. #13862

Merged
merged 1 commit into from Oct 5, 2021

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Oct 4, 2021

What does this PR do?

Fixes issue where special tokens of length 1 would not be cut.

The core of the issue, is that we would check for trie match AFTER moving 2 characters (1 character for first match, and ANOTHER one in the regular branch).

The fix does:

  • Check for termination before moving ahead in the regular branch (cleaner)
  • Adds another termination check at the end of the string because we might still have dangling states at that point.
  • Adds a few tests for those test cases directly for Trie tests.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing and adding a test!

@Narsil
Copy link
Contributor Author

Narsil commented Oct 4, 2021

I am running slow tokenizer tests on a box just to be sure before merging

@Narsil
Copy link
Contributor Author

Narsil commented Oct 4, 2021

================================================================================================================================ short test summary info =================================================================================================================================
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_attention_outputs - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_sample_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_sample_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_search_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_search_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_search_generate_dict_outputs_use_cache - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_correct_missing_keys - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_determinism - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_feed_forward_chunking - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_forward_signature - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_generate_with_head_masking - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_generate_without_input_ids - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_greedy_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_greedy_generate_dict_outputs - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_greedy_generate_dict_outputs_use_cache - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_group_beam_search_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_group_beam_search_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_headmasking - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_hidden_states_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_initialization - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_inputs_embeds - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_load_with_mismatched_shapes - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_model_common_attributes - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_model_outputs_equivalence - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_problem_types - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_resize_embeddings_untied - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_resize_tokens_embeddings - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_sample_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_sample_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load_fast_init_from_base - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load_fast_init_to_base - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load_keys_to_ignore_on_save - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_seq_classification_use_mems_train - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_tie_model_weights - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torch_fx - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torch_fx_output_loss - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torchscript - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torchscript_output_attentions - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torchscript_output_hidden_state - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_training - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_training_gradient_checkpointing - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_base_model - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_base_model_use_mems - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_base_model_with_att_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_lm_head - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_qa - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_sequence_classif - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_token_classif - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_deberta_v2.py::DebertaV2TokenizationTest::test_tf_encode_plus_sent_to_model - tensorflow.python.framework.errors_impl.ResourceExhaustedError: failed to allocate memory [Op:AddV2]
FAILED tests/test_tokenization_fnet.py::FNetTokenizationTest::test_tokenizer_integration - requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/google/fnet-base/revision/58e0d1f96af163dc8d0a84a2fddf4bd403e4e802
FAILED tests/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_torch_encode_plus_sent_to_model - TypeError: 'NoneType' object is not iterable
FAILED tests/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_torch_encode_plus_sent_to_model - ValueError: too many values to unpack (expected 2)
FAILED tests/test_tokenization_roformer.py::RoFormerTokenizationTest::test_saving_tokenizer_trainer - TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Results (6202.16s):
    4427 passed
      54 failed
         - tests/test_modeling_common.py:389 XLNetModelTest.test_attention_outputs
         - tests/test_generation_utils.py:842 XLNetModelTest.test_beam_sample_generate
         - tests/test_generation_utils.py:879 XLNetModelTest.test_beam_sample_generate_dict_output
         - tests/test_generation_utils.py:676 XLNetModelTest.test_beam_search_generate
         - tests/test_generation_utils.py:732 XLNetModelTest.test_beam_search_generate_dict_output
         - tests/test_generation_utils.py:788 XLNetModelTest.test_beam_search_generate_dict_outputs_use_cache
         - tests/test_modeling_common.py:1263 XLNetModelTest.test_correct_missing_keys
         - tests/test_modeling_common.py:310 XLNetModelTest.test_determinism
         - tests/test_modeling_common.py:1048 XLNetModelTest.test_feed_forward_chunking
         - tests/test_modeling_common.py:328 XLNetModelTest.test_forward_signature
         - tests/test_generation_utils.py:1076 XLNetModelTest.test_generate_with_head_masking
         - tests/test_generation_utils.py:938 XLNetModelTest.test_generate_without_input_ids
         - tests/test_generation_utils.py:517 XLNetModelTest.test_greedy_generate
         - tests/test_generation_utils.py:528 XLNetModelTest.test_greedy_generate_dict_outputs
         - tests/test_generation_utils.py:557 XLNetModelTest.test_greedy_generate_dict_outputs_use_cache
         - tests/test_generation_utils.py:957 XLNetModelTest.test_group_beam_search_generate
         - tests/test_generation_utils.py:1013 XLNetModelTest.test_group_beam_search_generate_dict_output
         - tests/test_modeling_common.py:712 XLNetModelTest.test_headmasking
         - tests/test_modeling_common.py:944 XLNetModelTest.test_hidden_states_output
         - tests/test_modeling_common.py:296 XLNetModelTest.test_initialization
         - tests/test_modeling_common.py:1395 XLNetModelTest.test_inputs_embeds
         - tests/test_modeling_common.py:1616 XLNetModelTest.test_load_with_mismatched_shapes
         - tests/test_modeling_common.py:1253 XLNetModelTest.test_model_common_attributes
         - tests/test_modeling_common.py:1327 XLNetModelTest.test_model_outputs_equivalence
         - tests/test_modeling_common.py:1573 XLNetModelTest.test_problem_types
         - tests/test_modeling_common.py:1202 XLNetModelTest.test_resize_embeddings_untied
         - tests/test_modeling_common.py:1150 XLNetModelTest.test_resize_tokens_embeddings
         - tests/test_generation_utils.py:585 XLNetModelTest.test_sample_generate
         - tests/test_generation_utils.py:630 XLNetModelTest.test_sample_generate_dict_output
         - tests/test_modeling_common.py:144 XLNetModelTest.test_save_load
         - tests/test_modeling_common.py:205 XLNetModelTest.test_save_load_fast_init_from_base
         - tests/test_modeling_common.py:250 XLNetModelTest.test_save_load_fast_init_to_base
         - tests/test_modeling_common.py:170 XLNetModelTest.test_save_load_keys_to_ignore_on_save
         - tests/test_modeling_xlnet.py:565 XLNetModelTest.test_seq_classification_use_mems_train
         - tests/test_modeling_common.py:1279 XLNetModelTest.test_tie_model_weights
         - tests/test_modeling_common.py:602 XLNetModelTest.test_torch_fx
         - tests/test_modeling_common.py:606 XLNetModelTest.test_torch_fx_output_loss
         - tests/test_modeling_common.py:504 XLNetModelTest.test_torchscript
         - tests/test_modeling_common.py:509 XLNetModelTest.test_torchscript_output_attentions
         - tests/test_modeling_common.py:515 XLNetModelTest.test_torchscript_output_hidden_state
         - tests/test_modeling_common.py:354 XLNetModelTest.test_training
         - tests/test_modeling_common.py:371 XLNetModelTest.test_training_gradient_checkpointing
         - tests/test_modeling_xlnet.py:554 XLNetModelTest.test_xlnet_base_model
         - tests/test_modeling_xlnet.py:559 XLNetModelTest.test_xlnet_base_model_use_mems
         - tests/test_modeling_xlnet.py:569 XLNetModelTest.test_xlnet_base_model_with_att_output
         - tests/test_modeling_xlnet.py:574 XLNetModelTest.test_xlnet_lm_head
         - tests/test_modeling_xlnet.py:589 XLNetModelTest.test_xlnet_qa
         - tests/test_modeling_xlnet.py:579 XLNetModelTest.test_xlnet_sequence_classif
         - tests/test_modeling_xlnet.py:584 XLNetModelTest.test_xlnet_token_classif
         - tests/test_tokenization_common.py:2067 DebertaV2TokenizationTest.test_tf_encode_plus_sent_to_model
         - tests/test_tokenization_fnet.py:433 FNetTokenizationTest.test_tokenizer_integration
         - tests/test_tokenization_layoutlmv2.py:1215 LayoutLMv2TokenizationTest.test_torch_encode_plus_sent_to_model
         - tests/test_tokenization_layoutlmv2.py:1215 LayoutLMv2TokenizationTest.test_torch_encode_plus_sent_to_model
         - tests/test_tokenization_common.py:3476 RoFormerTokenizationTest.test_saving_tokenizer_trainer

None of these seem to be linked to any of this, @sgugger are you ok with merging this ?

@sgugger
Copy link
Collaborator

sgugger commented Oct 4, 2021

Let's wait for @LysandreJik review if you don't mind.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The tests seem unrelated - we just released, if we see any new failure in the daily tests, we'll be able to revert while we fix.

@Narsil Narsil merged commit 7079a99 into huggingface:master Oct 5, 2021
@Narsil Narsil deleted the fix_1length_special_tokens branch October 5, 2021 16:35
stas00 pushed a commit to stas00/transformers that referenced this pull request Oct 12, 2021
lapisfluvialis pushed a commit to lapisfluvialis/transformers that referenced this pull request Oct 27, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants