New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] #3224
base: develop
Are you sure you want to change the base?
Conversation
@stevenbird Hi, I was wondering what you thought about this? The issue that I hope to fix here is the problem with word_tokenize() Failing to Split English Contractions When Followed by [\t\n\f\r] The solution I came up with was removing \t\n\f\r from the sentence. In terms of unintended side effects, I would not expect this to cause issue due to the fact that a space would replace the tab. I can add some tests to verify the fix, although first, I was wondering if you would be interested in merging it? Thank you? |
@purificant can I get a review of this? I think it is solid and ready for merging. Thank you |
This doesn't solve the full problem: word_tokenize would still fail to split contractions followed by [\a\b\v]. So the first step would be trying to find the extent of the problem. |
@ekaf thank you for figuring out the full problem scope. I was unaware of the other parts of this. Regarding the STARTING QUOTE and the END QUOTE part. I will have to look, but I believe the bug failed to be resolved unless I had the substitution present in both places. I would not add more lines of code than needed :) |
@Higgs32584, I don't know the full problem scope, there could be more... Neither do I know the best place to do the substitution, but I have verified that it works when listing it only under STARTING_QUOTE. Having it only under PUNCTUATION also works. |
The cause of the problem is that the two last lines under ENDING_QUOTE are handling contractions, using a regular expression that requires the contraction to be followed by a plain space. So the match fails when the contraction is followed by some other escaped whitespace-like character.
This substitution can probably be applied anywhere before the two last substitutions in ENDING_QUOTE, and I suppose it wouldn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is redundant to apply the same substitution twice.
The following substitution handles more white-space like characters:
(re.compile(r"\s+"), " "),
It should be applied before the two last substitutions in ENDING_QUOTE, which are causing the problem.
@ekaf Thank you for the suggestions. Do you need some test cases as well? I know some repos want test cases added with every bug fix. |
Thanks @Higgs32584, this looks good. Test cases are always much appreciated everywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this solves #3189.
With this pull request, I hope to fix the issue of #3189, or the word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r].
Obviously this is the first draft, but I would like some initial review and comments. Thank you