Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixed several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)
* Fixed issue with quote-tokenization, small regression for detokenization * Updated double-quote to single quote in doctest output * Resolved issue with 'wanna' absorbing a space too much in (de)tokenization * Allow importing TreebankWordDetokenizer from nltk.tokenize * Added additional test for span_tokenize * Add span_tokenize to NLTKWordTokenizer, like in TreebankWordTokenizer * Added credits for modifications
- Loading branch information
Showing
4 changed files
with
91 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters