You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that nltk.word_tokenize() failed to divide contraction words like "he's", "book's" when followed by [\t\n\f\r]. The examples below explain the issue.
How to Reproduce
sentence_1="he's a good boy."word_tokenize(sentence_1)
# ['he', "'s", 'a', 'good', 'boy', '.']
sentence_2="he's\t a good boy."word_tokenize(sentence_2)
# ["he's", 'a', 'good', 'boy', '.']
"he's" in sentence_2 is not split because it is followed by "\t" rather than a white space. This issue also applies to other whitespace characters like [\n\f\r].
Here is another, though seemingly weird, example. When the contraction is the last word in a sentence, it behaves the same no matter if it is followed by a white space character or not.
sentence_3 = "he's\f a good boy. he's\t"
word_tokenize(sentence_3)
# ["he's", 'a', 'good', 'boy', '.', 'he', "'s"]
Expected Behaviors
Contractions can be correctly split no matter if they are followed by whitespace characters or not.
Environments
Python: 3.7.12 and 3.10.12 nltk (install via pip): 3.8.1
The text was updated successfully, but these errors were encountered:
also the system fails to split when it is the second word as well
['he', "'s", 'a', 'good', 'boy', '.']
["he's", "he's", 'yolo']
["he's", 'a', 'good', 'boy', '.', 'he', "'s"]
Hi Maintainers,
I found that nltk.word_tokenize() failed to divide contraction words like "he's", "book's" when followed by [\t\n\f\r]. The examples below explain the issue.
How to Reproduce
"he's" in sentence_2 is not split because it is followed by "\t" rather than a white space. This issue also applies to other whitespace characters like [\n\f\r].
Here is another, though seemingly weird, example. When the contraction is the last word in a sentence, it behaves the same no matter if it is followed by a white space character or not.
Expected Behaviors
Contractions can be correctly split no matter if they are followed by whitespace characters or not.
Environments
Python: 3.7.12 and 3.10.12
nltk (install via pip): 3.8.1
The text was updated successfully, but these errors were encountered: