Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r] #3189

Open
donglihe-hub opened this issue Sep 27, 2023 · 2 comments

Comments

@donglihe-hub
Copy link

donglihe-hub commented Sep 27, 2023

Hi Maintainers,

I found that nltk.word_tokenize() failed to divide contraction words like "he's", "book's" when followed by [\t\n\f\r]. The examples below explain the issue.

How to Reproduce

sentence_1 = "he's a good boy."
word_tokenize(sentence_1)
# ['he', "'s", 'a', 'good', 'boy', '.']
sentence_2 = "he's\t a good boy."
word_tokenize(sentence_2)
# ["he's", 'a', 'good', 'boy', '.']

"he's" in sentence_2 is not split because it is followed by "\t" rather than a white space. This issue also applies to other whitespace characters like [\n\f\r].

Here is another, though seemingly weird, example. When the contraction is the last word in a sentence, it behaves the same no matter if it is followed by a white space character or not.

sentence_3 = "he's\f a good boy. he's\t"
word_tokenize(sentence_3)
# ["he's", 'a', 'good', 'boy', '.', 'he', "'s"]

Expected Behaviors

Contractions can be correctly split no matter if they are followed by whitespace characters or not.

Environments

Python: 3.7.12 and 3.10.12
nltk (install via pip): 3.8.1

@Higgs32584
Copy link

also the system fails to split when it is the second word as well
['he', "'s", 'a', 'good', 'boy', '.']
["he's", "he's", 'yolo']
["he's", 'a', 'good', 'boy', '.', 'he', "'s"]

@ekaf
Copy link
Contributor

ekaf commented Jan 21, 2024

word_tokenize also fails to split contractions followed by [\a\b\v].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants