Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed several TreebankWordTokenizer and NLTKWordTokenizer bugs #2877

Merged
merged 7 commits into from Nov 6, 2021

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Nov 4, 2021

Resolves #1750, resolves #2076, resolves #2876

Hello!

Pull request overview

  • (Bug 1) Correct tokenisation error when using '' (two single quotes) as starting quotes.
  • (Bug 2) Resolved issue with 'wanna' in MacIntyreContractions for TreebankWordTokenizer and NLTKWordTokenizer.
  • Implement span_tokenize in NLTKWordTokenizer, just like in TreebankWordTokenizer.
  • Allow TreebankWordDetokenizer to be imported with from nltk.tokenize import TreebankWordDetokenizer.

Bug 1

Relevant issues: #1750 and #2076

Reproduction

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
text = "''Hello'\""
print(list(tokenizer.span_tokenize(text)))

produces

Traceback (most recent call last):
  File "[sic]\nltk\tokenize\util.py", line 290, in align_tokens
    start = sentence.index(token, point)
ValueError: substring not found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[sic]\nltk_1750.py", line 42, in <module>
    print(list(tokenizer.span_tokenize(text)))
  File "[sic]\nltk\tokenize\treebank.py", line 190, in span_tokenize
    yield from align_tokens(tokens, text)
  File "[sic]\nltk\tokenize\util.py", line 292, in align_tokens
    raise ValueError(f'substring "{token}" not found in "{sentence}"') from e
ValueError: substring "''" not found in "''Hello'""

Details & The fix

The true cause here is that the span_tokenize method expects tokens with quotes to be exclusively '', " or ``:

tokens = [
matched.pop(0) if tok in ['"', "``", "''"] else tok
for tok in raw_tokens
]

This is because span_tokenize assumes that the tokenized output will always split off these quotes into separate tokens. However, this is not the case currently. For example:

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
text = "''Hello''"
print(tokenizer.tokenize(text))

produces

["''Hello", "''"]

This is the true underlying bug. This bug exists in both TreebankWordTokenizer and NLTKWordTokenizer!
The cause is the following ENDING_QUOTES rule:

(re.compile(r"(\S)(\'\')"), r"\1 \2 "),

(\S)(\'\') requires a non-whitespace character before the ''. Only then will it separate the quote and what comes before - and crucially: only then will it place a space on the right-hand side of the ''. I don't believe this non-whitespace character requirement to matter. We can remove it, and create:

    (re.compile(r"''"), " '' "),

Upon making this change, the last mentioned program outputs:

["''", "Hello", "''"]

And the issues reported for span_tokenize disappear like expected.

Undesired consequences of the fix

This PR improves tokenization by separating these quotes in a better way, but it does have an undesired consequence: TreebankWordDetokenizer doesn't work as well as a result. Or rather, TreebankWordDetokenizer works identically, but the output of the tokenizer has changed somewhat. This is a consequence:

from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()
print(detokenizer.detokenize(["''", "Hello", "''"]))

produces

" Hello"

This is because the Detokenizer doesn't consider '' as a start of quotation, but only as the end. After all, if we let it detokenize ["one", "''", "two"], how would it know whether one" two or one "two is correct? It defaults to one" two, as " is generally an ending quote, with "``" as a beginning quote.

Notes for Bug 1

This PR modifies both NLTKWordTokenizer and TreebankWordTokenizer. I recognise that it might be preferable to keep the latter true to the original (as much as possible). If we want to do that, then I can simply revert the changes there. But do note that then the issues for span_tokenize will not be solved.
These issues can also be solved in different ways, though. We can also direct users to the (new) NLTKWordTokenizer's span_tokenize which this PR adds, but I'm not a great fan of that.


Bug 2

Relevant issue: #2876

Reproduction

from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer

tokens = TreebankWordTokenizer().tokenize("I wanna watch something")
print(tokens)
sentence = TreebankWordDetokenizer().detokenize(tokens)
print(sentence)

produces

I wannawatch something

Details & The fix

This issue originates from the last line of this set of CONTRACTION regexes:

CONTRACTIONS2 = [
r"(?i)\b(can)(?#X)(not)\b",
r"(?i)\b(d)(?#X)('ye)\b",
r"(?i)\b(gim)(?#X)(me)\b",
r"(?i)\b(gon)(?#X)(na)\b",
r"(?i)\b(got)(?#X)(ta)\b",
r"(?i)\b(lem)(?#X)(me)\b",
r"(?i)\b(more)(?#X)('n)\b",
r"(?i)\b(wan)(?#X)(na)\s",
]

Where the rule requires the phrase to end in a whitespace with \s (which is one character big), rather than \b (word boundary, 0 characters big). The issue is that the Detokenizer considers the whitespace (e.g. a space) to be a part of the regex rule match. This is then replaced, while it shouldn't be.

A simple solution is to just use \b. However, this will fail for the edge case of wanna-be, where wanna should not be split.

So, another solution is to still check for \s, but don't include it in the match. This can be done with a positive lookahead:

    r"(?i)\b(wan)(?#X)(na)(?=\s)", 

After this fix, both wanna watch and wanna-be work properly. There is a test for both of these cases.


Added span_tokenize to NLTKWordTokenizer

As NLTKWordTokenizer is merely NLTK's improved version of TreebankWordTokenizer, it kind of surprised me that NLTKWordTokenizer didn't already provide all methods that TreebankWordTokenizer has. Because both classes are so similar, both methods are identical.
However, if we choose not to update TreebankWordTokenizer, then span_tokenize for TreebankWordTokenizer will be broken, while the one for NLTKWordTokenizer should still work. I took the doctests from TreebankWordTokenizer's span_tokenize, and added them in tokenize.doctest too.

Improved accessibility of TreebankWordDetokenizer

Currently, TreebankWordDetokenizer can only be imported with:

from nltk.tokenize.treebank import TreebankWordDetokenizer 

I've added TreebankWordDetokenizer to nltk/tokenize/__init__.py, allowing users to import it like so:

from nltk.tokenize import TreebankWordDetokenizer 
# or even
from nltk import TreebankWordDetokenizer 

Future changes

Create some common interface for NLTKWordTokenizer and TreebankWordTokenizer. There is a good bit of code reuse, especially the new span_tokenize.

  • Tom Aarsen

@stevenbird
Copy link
Member

@tomaarsen thanks for the analysis... I like your compromise which favours the correctness of tokenisation over detokenisation (which is less often required)

@tomaarsen tomaarsen deleted the bugfix/treebank-quotes branch November 7, 2021 19:21
hosseinkhaledi added a commit to arushadev/piraye that referenced this pull request Dec 15, 2022
hosseinkhaledi added a commit to arushadev/piraye that referenced this pull request Dec 15, 2022
hosseinkhaledi added a commit to arushadev/piraye that referenced this pull request Dec 15, 2022
* Update nltk version
nltk/nltk#2877

* Remove hazm requirements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment