fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

Higgs32584 · 2023-12-27T19:49:25Z

With this pull request, I hope to fix the issue of #3189, or the word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r].

Obviously this is the first draft, but I would like some initial review and comments. Thank you

…o naissue_3189

Higgs32584 · 2023-12-27T20:25:05Z

#3189

Higgs32584 · 2023-12-29T17:28:59Z

@stevenbird Hi, I was wondering what you thought about this? The issue that I hope to fix here is the problem with word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] The solution I came up with was removing \t\n\f\r from the sentence.

In terms of unintended side effects, I would not expect this to cause issue due to the fact that a space would replace the tab.

I can add some tests to verify the fix, although first, I was wondering if you would be interested in merging it? Thank you?

Higgs32584 · 2024-01-18T15:45:44Z

@purificant can I get a review of this? I think it is solid and ready for merging. Thank you

ekaf · 2024-01-21T08:49:36Z

This doesn't solve the full problem: word_tokenize would still fail to split contractions followed by [\a\b\v]. So the first step would be trying to find the extent of the problem.
Concerning the implementation, there is no point in making the same substitution in both STARTING_QUOTE and END_QUOTE:
after the substitution has been applied once, there is nothing left to substitute in the second round.
The ideal reviewers for this PR would be the authors of the destructive.py file (@alvations and @tomaarsen ). But there is no hurry, as long as this work is far from mergeable in its present state.

Higgs32584 · 2024-01-21T15:48:47Z

@ekaf thank you for figuring out the full problem scope. I was unaware of the other parts of this.

Regarding the STARTING QUOTE and the END QUOTE part. I will have to look, but I believe the bug failed to be resolved unless I had the substitution present in both places. I would not add more lines of code than needed :)

ekaf · 2024-01-21T16:39:22Z

@Higgs32584, I don't know the full problem scope, there could be more... Neither do I know the best place to do the substitution, but I have verified that it works when listing it only under STARTING_QUOTE. Having it only under PUNCTUATION also works.

ekaf · 2024-02-04T12:22:19Z

The cause of the problem is that the two last lines under ENDING_QUOTE are handling contractions, using a regular expression that requires the contraction to be followed by a plain space. So the match fails when the contraction is followed by some other escaped whitespace-like character.
The problem can be solved by converting all whitespace escape sequences to a single plain space, for ex. with the following substitution pair:

(re.compile(r"\s+"), " "),

This substitution can probably be applied anywhere before the two last substitutions in ENDING_QUOTE, and I suppose it wouldn't
hurt to apply it early.

ekaf

It is redundant to apply the same substitution twice.

The following substitution handles more white-space like characters:
(re.compile(r"\s+"), " "),
It should be applied before the two last substitutions in ENDING_QUOTE, which are causing the problem.

Higgs32584 · 2024-02-04T15:37:35Z

@ekaf Thank you for the suggestions. Do you need some test cases as well? I know some repos want test cases added with every bug fix.

ekaf · 2024-02-04T16:30:31Z

Thanks @Higgs32584, this looks good. Test cases are always much appreciated everywhere.

ekaf

I think this solves #3189.

Higgs32584 added 3 commits December 27, 2023 19:33

first draft

f54ed5f

Merge branch 'nltk:develop' into naissue_3189

d1a9894

first draft

9b18fa9

github-actions bot added the tokenizer label Dec 27, 2023

Higgs32584 added 2 commits December 27, 2023 19:57

added space instead

8fa2042

Merge branch 'naissue_3189' of https://github.com/Higgs32584/nltk int…

2032f41

…o naissue_3189

Higgs32584 changed the title ~~Naissue 3189~~ fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] Dec 29, 2023

ekaf requested changes Feb 4, 2024

View reviewed changes

Higgs32584 added 3 commits February 4, 2024 09:34

Merge branch 'nltk:develop' into naissue_3189

1fb4e30

Update destructive.py

b49efa1

Update destructive.py

51effd8

ekaf approved these changes Feb 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

Higgs32584 commented Dec 27, 2023

Higgs32584 commented Dec 27, 2023

Higgs32584 commented Dec 29, 2023

Higgs32584 commented Jan 18, 2024

ekaf commented Jan 21, 2024

Higgs32584 commented Jan 21, 2024

ekaf commented Jan 21, 2024

ekaf commented Feb 4, 2024

ekaf left a comment

Higgs32584 commented Feb 4, 2024

ekaf commented Feb 4, 2024

ekaf left a comment

fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

Are you sure you want to change the base?

fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

Conversation

Higgs32584 commented Dec 27, 2023

Higgs32584 commented Dec 27, 2023

Higgs32584 commented Dec 29, 2023

Higgs32584 commented Jan 18, 2024

ekaf commented Jan 21, 2024

Higgs32584 commented Jan 21, 2024

ekaf commented Jan 21, 2024

ekaf commented Feb 4, 2024

ekaf left a comment

Choose a reason for hiding this comment

Higgs32584 commented Feb 4, 2024

ekaf commented Feb 4, 2024

ekaf left a comment

Choose a reason for hiding this comment