word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] #3189

donglihe-hub · 2023-09-27T11:34:24Z

Hi Maintainers,

I found that nltk.word_tokenize() failed to divide contraction words like "he's", "book's" when followed by [\t\n\f\r]. The examples below explain the issue.

How to Reproduce

sentence_1 = "he's a good boy."
word_tokenize(sentence_1)
# ['he', "'s", 'a', 'good', 'boy', '.']

sentence_2 = "he's\t a good boy."
word_tokenize(sentence_2)
# ["he's", 'a', 'good', 'boy', '.']

"he's" in sentence_2 is not split because it is followed by "\t" rather than a white space. This issue also applies to other whitespace characters like [\n\f\r].

Here is another, though seemingly weird, example. When the contraction is the last word in a sentence, it behaves the same no matter if it is followed by a white space character or not.

sentence_3 = "he's\f a good boy. he's\t"
word_tokenize(sentence_3)
# ["he's", 'a', 'good', 'boy', '.', 'he', "'s"]

Expected Behaviors

Contractions can be correctly split no matter if they are followed by whitespace characters or not.

Environments

Python: 3.7.12 and 3.10.12
nltk (install via pip): 3.8.1

The text was updated successfully, but these errors were encountered:

Higgs32584 · 2023-12-22T20:52:48Z

also the system fails to split when it is the second word as well
['he', "'s", 'a', 'good', 'boy', '.']
["he's", "he's", 'yolo']
["he's", 'a', 'good', 'boy', '.', 'he', "'s"]

ekaf · 2024-01-21T08:35:11Z

word_tokenize also fails to split contractions followed by [\a\b\v].

This was referenced Dec 27, 2023

first draft, plan to fix issue #3189 #3223

Closed

fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] #3189

word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] #3189

donglihe-hub commented Sep 27, 2023 •

edited

Higgs32584 commented Dec 22, 2023

ekaf commented Jan 21, 2024 •

edited

word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] #3189

word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] #3189

Comments

donglihe-hub commented Sep 27, 2023 • edited

How to Reproduce

Expected Behaviors

Environments

Higgs32584 commented Dec 22, 2023

ekaf commented Jan 21, 2024 • edited

donglihe-hub commented Sep 27, 2023 •

edited

ekaf commented Jan 21, 2024 •

edited