Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

Higgs32584
Copy link

With this pull request, I hope to fix the issue of #3189, or the word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r].

Obviously this is the first draft, but I would like some initial review and comments. Thank you

@Higgs32584
Copy link
Author

#3189

@Higgs32584 Higgs32584 changed the title Naissue 3189 fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] Dec 29, 2023
@Higgs32584
Copy link
Author

@stevenbird Hi, I was wondering what you thought about this? The issue that I hope to fix here is the problem with word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] The solution I came up with was removing \t\n\f\r from the sentence.

In terms of unintended side effects, I would not expect this to cause issue due to the fact that a space would replace the tab.

I can add some tests to verify the fix, although first, I was wondering if you would be interested in merging it? Thank you?

@Higgs32584
Copy link
Author

@purificant can I get a review of this? I think it is solid and ready for merging. Thank you

@ekaf
Copy link
Contributor

ekaf commented Jan 21, 2024

This doesn't solve the full problem: word_tokenize would still fail to split contractions followed by [\a\b\v]. So the first step would be trying to find the extent of the problem.
Concerning the implementation, there is no point in making the same substitution in both STARTING_QUOTE and END_QUOTE:
after the substitution has been applied once, there is nothing left to substitute in the second round.
The ideal reviewers for this PR would be the authors of the destructive.py file (@alvations and @tomaarsen ). But there is no hurry, as long as this work is far from mergeable in its present state.

@Higgs32584
Copy link
Author

@ekaf thank you for figuring out the full problem scope. I was unaware of the other parts of this.

Regarding the STARTING QUOTE and the END QUOTE part. I will have to look, but I believe the bug failed to be resolved unless I had the substitution present in both places. I would not add more lines of code than needed :)

@ekaf
Copy link
Contributor

ekaf commented Jan 21, 2024

@Higgs32584, I don't know the full problem scope, there could be more... Neither do I know the best place to do the substitution, but I have verified that it works when listing it only under STARTING_QUOTE. Having it only under PUNCTUATION also works.

@ekaf
Copy link
Contributor

ekaf commented Feb 4, 2024

The cause of the problem is that the two last lines under ENDING_QUOTE are handling contractions, using a regular expression that requires the contraction to be followed by a plain space. So the match fails when the contraction is followed by some other escaped whitespace-like character.
The problem can be solved by converting all whitespace escape sequences to a single plain space, for ex. with the following substitution pair:

(re.compile(r"\s+"), " "),

This substitution can probably be applied anywhere before the two last substitutions in ENDING_QUOTE, and I suppose it wouldn't
hurt to apply it early.

Copy link
Contributor

@ekaf ekaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is redundant to apply the same substitution twice.

The following substitution handles more white-space like characters:
(re.compile(r"\s+"), " "),
It should be applied before the two last substitutions in ENDING_QUOTE, which are causing the problem.

@Higgs32584
Copy link
Author

@ekaf Thank you for the suggestions. Do you need some test cases as well? I know some repos want test cases added with every bug fix.

@ekaf
Copy link
Contributor

ekaf commented Feb 4, 2024

Thanks @Higgs32584, this looks good. Test cases are always much appreciated everywhere.

Copy link
Contributor

@ekaf ekaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this solves #3189.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants