Improved Tokenize documentation + added TokenizerI as superclass for TweetTokenizer #2878

tomaarsen · 2021-11-07T22:11:23Z

Hello!

Pull request overview

Improved tokenize documentation:
- Added docstrings for span_tokenize and tokenize for NLTKWordTokenizer and TreebankWordTokenizer.
- Added Python 3.5+ typing to several methods.
Made TokenizerI the superclass for TweetTokenizer, which adds the tokenize_sents method for free.

The documentation changes speak for themselves, so I'll only briefly discuss the other changes.

New superclass for `TweetTokenizer`

TweetTokenizer is (I believe) the only tokenizer that didn't subclass from TokenizerI yet. Making it subclass from TokenizerI will automatically implement tokenize_sents for us, and allow easier implementation of tokenize_span in the future.
I've added a test to show that this works correctly.

Oh, also, I made a small mistake with git, so there's one unrelated commit which I didn't see until just now. It doesn't affect the PR and It shouldn't matter if we just squash and merge anyways, but it can be removed from the squash commit message.

Tom Aarsen

…nizer

By subclassing it with TokenizerI

…nt/tokenize_documentation

tomaarsen · 2021-11-20T21:25:39Z

Resolved merge conflicts with return_str deprecations.

stevenbird · 2021-11-21T07:38:49Z

Thanks @tomaarsen!

tomaarsen · 2021-11-21T11:13:04Z

Gladly!

tomaarsen added 5 commits November 4, 2021 21:10

Add span_tokenize to NLTKWordTokenizer, just like in TreebankWordToke…

2c57d1c

…nizer

Merge branch 'develop' of https://github.com/nltk/nltk into develop

e232fe8

Added documentation for core tokenization modules

f49107b

Added tokenize_sents method to TweetTokenizer

6cf7c0a

By subclassing it with TokenizerI

Resolved documentation indentation issue in tokenize/casual.py

24af735

tomaarsen added documentation tokenizer labels Nov 7, 2021

tomaarsen added 2 commits November 7, 2021 23:16

Fixed copy-paste issue in tokenize docstring

2218952

Merge branch 'develop' of https://github.com/nltk/nltk into enhanceme…

1de3f72

…nt/tokenize_documentation

stevenbird merged commit b30b6ac into nltk:develop Nov 21, 2021

tomaarsen deleted the enhancement/tokenize_documentation branch November 21, 2021 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Tokenize documentation + added TokenizerI as superclass for TweetTokenizer #2878

Improved Tokenize documentation + added TokenizerI as superclass for TweetTokenizer #2878

tomaarsen commented Nov 7, 2021

tomaarsen commented Nov 20, 2021

stevenbird commented Nov 21, 2021

tomaarsen commented Nov 21, 2021

Improved Tokenize documentation + added TokenizerI as superclass for TweetTokenizer #2878

Improved Tokenize documentation + added TokenizerI as superclass for TweetTokenizer #2878

Conversation

tomaarsen commented Nov 7, 2021

Pull request overview

New superclass for TweetTokenizer

tomaarsen commented Nov 20, 2021

stevenbird commented Nov 21, 2021

tomaarsen commented Nov 21, 2021

New superclass for `TweetTokenizer`