Fixed several TreebankWordTokenizer and NLTKWordTokenizer bugs #2877
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #1750, resolves #2076, resolves #2876
Hello!
Pull request overview
''
(two single quotes) as starting quotes.'wanna'
in MacIntyreContractions for TreebankWordTokenizer and NLTKWordTokenizer.span_tokenize
in NLTKWordTokenizer, just like in TreebankWordTokenizer.from nltk.tokenize import TreebankWordDetokenizer
.Bug 1
Relevant issues: #1750 and #2076
Reproduction
produces
Details & The fix
The true cause here is that the
span_tokenize
method expects tokens with quotes to be exclusively''
,"
or ``:nltk/nltk/tokenize/treebank.py
Lines 183 to 186 in ec1d49d
This is because
span_tokenize
assumes that the tokenized output will always split off these quotes into separate tokens. However, this is not the case currently. For example:produces
This is the true underlying bug. This bug exists in both TreebankWordTokenizer and NLTKWordTokenizer!
The cause is the following ENDING_QUOTES rule:
nltk/nltk/tokenize/treebank.py
Line 90 in ec1d49d
(\S)(\'\')
requires a non-whitespace character before the''
. Only then will it separate the quote and what comes before - and crucially: only then will it place a space on the right-hand side of the''
. I don't believe this non-whitespace character requirement to matter. We can remove it, and create:Upon making this change, the last mentioned program outputs:
And the issues reported for
span_tokenize
disappear like expected.Undesired consequences of the fix
This PR improves tokenization by separating these quotes in a better way, but it does have an undesired consequence: TreebankWordDetokenizer doesn't work as well as a result. Or rather, TreebankWordDetokenizer works identically, but the output of the tokenizer has changed somewhat. This is a consequence:
produces
This is because the Detokenizer doesn't consider
''
as a start of quotation, but only as the end. After all, if we let it detokenize["one", "''", "two"]
, how would it know whetherone" two
orone "two
is correct? It defaults toone" two
, as"
is generally an ending quote, with "``" as a beginning quote.Notes for Bug 1
This PR modifies both NLTKWordTokenizer and TreebankWordTokenizer. I recognise that it might be preferable to keep the latter true to the original (as much as possible). If we want to do that, then I can simply revert the changes there. But do note that then the issues for
span_tokenize
will not be solved.These issues can also be solved in different ways, though. We can also direct users to the (new) NLTKWordTokenizer's
span_tokenize
which this PR adds, but I'm not a great fan of that.Bug 2
Relevant issue: #2876
Reproduction
produces
Details & The fix
This issue originates from the last line of this set of CONTRACTION regexes:
nltk/nltk/tokenize/destructive.py
Lines 19 to 28 in ec1d49d
Where the rule requires the phrase to end in a whitespace with
\s
(which is one character big), rather than\b
(word boundary, 0 characters big). The issue is that the Detokenizer considers the whitespace (e.g. a space) to be a part of the regex rule match. This is then replaced, while it shouldn't be.A simple solution is to just use
\b
. However, this will fail for the edge case ofwanna-be
, wherewanna
should not be split.So, another solution is to still check for
\s
, but don't include it in the match. This can be done with a positive lookahead:r"(?i)\b(wan)(?#X)(na)(?=\s)",
After this fix, both
wanna watch
andwanna-be
work properly. There is a test for both of these cases.Added
span_tokenize
to NLTKWordTokenizerAs NLTKWordTokenizer is merely NLTK's improved version of TreebankWordTokenizer, it kind of surprised me that NLTKWordTokenizer didn't already provide all methods that TreebankWordTokenizer has. Because both classes are so similar, both methods are identical.
However, if we choose not to update TreebankWordTokenizer, then
span_tokenize
for TreebankWordTokenizer will be broken, while the one for NLTKWordTokenizer should still work. I took the doctests from TreebankWordTokenizer'sspan_tokenize
, and added them intokenize.doctest
too.Improved accessibility of TreebankWordDetokenizer
Currently, TreebankWordDetokenizer can only be imported with:
I've added TreebankWordDetokenizer to
nltk/tokenize/__init__.py
, allowing users to import it like so:Future changes
Create some common interface for NLTKWordTokenizer and TreebankWordTokenizer. There is a good bit of code reuse, especially the new
span_tokenize
.