Fixing 1-length special tokens cut. #13862

Narsil · 2021-10-04T13:22:37Z

What does this PR do?

Fixes issue where special tokens of length 1 would not be cut.

The core of the issue, is that we would check for trie match AFTER moving 2 characters (1 character for first match, and ANOTHER one in the regular branch).

The fix does:

Check for termination before moving ahead in the regular branch (cleaner)
Adds another termination check at the end of the string because we might still have dangling states at that point.
Adds a few tests for those test cases directly for Trie tests.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sgugger

Thanks for fixing and adding a test!

Narsil · 2021-10-04T15:06:21Z

I am running slow tokenizer tests on a box just to be sure before merging

Narsil · 2021-10-04T18:02:46Z

================================================================================================================================ short test summary info =================================================================================================================================
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_attention_outputs - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_sample_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_sample_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_search_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_search_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_beam_search_generate_dict_outputs_use_cache - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_correct_missing_keys - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_determinism - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_feed_forward_chunking - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_forward_signature - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_generate_with_head_masking - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_generate_without_input_ids - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_greedy_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_greedy_generate_dict_outputs - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_greedy_generate_dict_outputs_use_cache - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_group_beam_search_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_group_beam_search_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_headmasking - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_hidden_states_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_initialization - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_inputs_embeds - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_load_with_mismatched_shapes - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_model_common_attributes - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_model_outputs_equivalence - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_problem_types - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_resize_embeddings_untied - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_resize_tokens_embeddings - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_sample_generate - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_sample_generate_dict_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load_fast_init_from_base - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load_fast_init_to_base - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_save_load_keys_to_ignore_on_save - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_seq_classification_use_mems_train - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_tie_model_weights - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torch_fx - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torch_fx_output_loss - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torchscript - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torchscript_output_attentions - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_torchscript_output_hidden_state - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_training - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_training_gradient_checkpointing - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_base_model - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_base_model_use_mems - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_base_model_with_att_output - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_lm_head - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_qa - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_sequence_classif - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_cpm.py::XLNetModelTest::test_xlnet_token_classif - RuntimeError: CUDA error: out of memory
FAILED tests/test_tokenization_deberta_v2.py::DebertaV2TokenizationTest::test_tf_encode_plus_sent_to_model - tensorflow.python.framework.errors_impl.ResourceExhaustedError: failed to allocate memory [Op:AddV2]
FAILED tests/test_tokenization_fnet.py::FNetTokenizationTest::test_tokenizer_integration - requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/google/fnet-base/revision/58e0d1f96af163dc8d0a84a2fddf4bd403e4e802
FAILED tests/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_torch_encode_plus_sent_to_model - TypeError: 'NoneType' object is not iterable
FAILED tests/test_tokenization_layoutlmv2.py::LayoutLMv2TokenizationTest::test_torch_encode_plus_sent_to_model - ValueError: too many values to unpack (expected 2)
FAILED tests/test_tokenization_roformer.py::RoFormerTokenizationTest::test_saving_tokenizer_trainer - TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Results (6202.16s):
    4427 passed
      54 failed
         - tests/test_modeling_common.py:389 XLNetModelTest.test_attention_outputs
         - tests/test_generation_utils.py:842 XLNetModelTest.test_beam_sample_generate
         - tests/test_generation_utils.py:879 XLNetModelTest.test_beam_sample_generate_dict_output
         - tests/test_generation_utils.py:676 XLNetModelTest.test_beam_search_generate
         - tests/test_generation_utils.py:732 XLNetModelTest.test_beam_search_generate_dict_output
         - tests/test_generation_utils.py:788 XLNetModelTest.test_beam_search_generate_dict_outputs_use_cache
         - tests/test_modeling_common.py:1263 XLNetModelTest.test_correct_missing_keys
         - tests/test_modeling_common.py:310 XLNetModelTest.test_determinism
         - tests/test_modeling_common.py:1048 XLNetModelTest.test_feed_forward_chunking
         - tests/test_modeling_common.py:328 XLNetModelTest.test_forward_signature
         - tests/test_generation_utils.py:1076 XLNetModelTest.test_generate_with_head_masking
         - tests/test_generation_utils.py:938 XLNetModelTest.test_generate_without_input_ids
         - tests/test_generation_utils.py:517 XLNetModelTest.test_greedy_generate
         - tests/test_generation_utils.py:528 XLNetModelTest.test_greedy_generate_dict_outputs
         - tests/test_generation_utils.py:557 XLNetModelTest.test_greedy_generate_dict_outputs_use_cache
         - tests/test_generation_utils.py:957 XLNetModelTest.test_group_beam_search_generate
         - tests/test_generation_utils.py:1013 XLNetModelTest.test_group_beam_search_generate_dict_output
         - tests/test_modeling_common.py:712 XLNetModelTest.test_headmasking
         - tests/test_modeling_common.py:944 XLNetModelTest.test_hidden_states_output
         - tests/test_modeling_common.py:296 XLNetModelTest.test_initialization
         - tests/test_modeling_common.py:1395 XLNetModelTest.test_inputs_embeds
         - tests/test_modeling_common.py:1616 XLNetModelTest.test_load_with_mismatched_shapes
         - tests/test_modeling_common.py:1253 XLNetModelTest.test_model_common_attributes
         - tests/test_modeling_common.py:1327 XLNetModelTest.test_model_outputs_equivalence
         - tests/test_modeling_common.py:1573 XLNetModelTest.test_problem_types
         - tests/test_modeling_common.py:1202 XLNetModelTest.test_resize_embeddings_untied
         - tests/test_modeling_common.py:1150 XLNetModelTest.test_resize_tokens_embeddings
         - tests/test_generation_utils.py:585 XLNetModelTest.test_sample_generate
         - tests/test_generation_utils.py:630 XLNetModelTest.test_sample_generate_dict_output
         - tests/test_modeling_common.py:144 XLNetModelTest.test_save_load
         - tests/test_modeling_common.py:205 XLNetModelTest.test_save_load_fast_init_from_base
         - tests/test_modeling_common.py:250 XLNetModelTest.test_save_load_fast_init_to_base
         - tests/test_modeling_common.py:170 XLNetModelTest.test_save_load_keys_to_ignore_on_save
         - tests/test_modeling_xlnet.py:565 XLNetModelTest.test_seq_classification_use_mems_train
         - tests/test_modeling_common.py:1279 XLNetModelTest.test_tie_model_weights
         - tests/test_modeling_common.py:602 XLNetModelTest.test_torch_fx
         - tests/test_modeling_common.py:606 XLNetModelTest.test_torch_fx_output_loss
         - tests/test_modeling_common.py:504 XLNetModelTest.test_torchscript
         - tests/test_modeling_common.py:509 XLNetModelTest.test_torchscript_output_attentions
         - tests/test_modeling_common.py:515 XLNetModelTest.test_torchscript_output_hidden_state
         - tests/test_modeling_common.py:354 XLNetModelTest.test_training
         - tests/test_modeling_common.py:371 XLNetModelTest.test_training_gradient_checkpointing
         - tests/test_modeling_xlnet.py:554 XLNetModelTest.test_xlnet_base_model
         - tests/test_modeling_xlnet.py:559 XLNetModelTest.test_xlnet_base_model_use_mems
         - tests/test_modeling_xlnet.py:569 XLNetModelTest.test_xlnet_base_model_with_att_output
         - tests/test_modeling_xlnet.py:574 XLNetModelTest.test_xlnet_lm_head
         - tests/test_modeling_xlnet.py:589 XLNetModelTest.test_xlnet_qa
         - tests/test_modeling_xlnet.py:579 XLNetModelTest.test_xlnet_sequence_classif
         - tests/test_modeling_xlnet.py:584 XLNetModelTest.test_xlnet_token_classif
         - tests/test_tokenization_common.py:2067 DebertaV2TokenizationTest.test_tf_encode_plus_sent_to_model
         - tests/test_tokenization_fnet.py:433 FNetTokenizationTest.test_tokenizer_integration
         - tests/test_tokenization_layoutlmv2.py:1215 LayoutLMv2TokenizationTest.test_torch_encode_plus_sent_to_model
         - tests/test_tokenization_layoutlmv2.py:1215 LayoutLMv2TokenizationTest.test_torch_encode_plus_sent_to_model
         - tests/test_tokenization_common.py:3476 RoFormerTokenizationTest.test_saving_tokenizer_trainer

None of these seem to be linked to any of this, @sgugger are you ok with merging this ?

sgugger · 2021-10-04T18:15:04Z

Let's wait for @LysandreJik review if you don't mind.

LysandreJik

Looks good to me. The tests seem unrelated - we just released, if we see any new failure in the daily tests, we'll be able to revert while we fix.

Fixing 1-length special tokens cut.

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

0a668ea

Narsil requested review from sgugger and LysandreJik October 4, 2021 13:25

sgugger approved these changes Oct 4, 2021

View reviewed changes

LysandreJik approved these changes Oct 5, 2021

View reviewed changes

Narsil merged commit 7079a99 into huggingface:master Oct 5, 2021

Narsil deleted the fix_1length_special_tokens branch October 5, 2021 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing 1-length special tokens cut. #13862

Fixing 1-length special tokens cut. #13862

Narsil commented Oct 4, 2021 •

edited

Loading

sgugger left a comment

Narsil commented Oct 4, 2021

Narsil commented Oct 4, 2021

sgugger commented Oct 4, 2021

LysandreJik left a comment

Fixing 1-length special tokens cut. #13862

Fixing 1-length special tokens cut. #13862

Conversation

Narsil commented Oct 4, 2021 • edited Loading

What does this PR do?

Before submitting

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

Narsil commented Oct 4, 2021

Narsil commented Oct 4, 2021

sgugger commented Oct 4, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

Narsil commented Oct 4, 2021 •

edited

Loading