Deprecate `return_str` parameter in `NLTKWordTokenizer` and `TreebankWordTokenizer` #2879

tomaarsen · 2021-11-07T22:25:57Z

Hello!

I'd like to discuss a potential enhancements of NLTKWordTokenizer and TreebankWordTokenizer. For those unaware, the former is the tokenizer that is most frequently used, and is used in the word_tokenize function. It's also based on the latter class: TreebankWordTokenizer.

An example usage as can be found in the documentation:

>>> from nltk.tokenize import NLTKWordTokenizer
>>> s = '''Good muffins cost $3.88 (roughly 3,36 euros)\nin New York.  Please buy me\ntwo of them.\nThanks.'''
>>> NLTKWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', '(', 'roughly', '3,36',
'euros', ')', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two',
'of', 'them.', 'Thanks', '.']
>>> NLTKWordTokenizer().tokenize(s, convert_parentheses=True)
['Good', 'muffins', 'cost', '$', '3.88', '-LRB-', 'roughly', '3,36',
'euros', '-RRB-', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two',
'of', 'them.', 'Thanks', '.']
>>> NLTKWordTokenizer().tokenize(s, return_str=True)
' Good muffins cost  $ 3.88  ( roughly 3,36 euros ) \nin New York.  Please buy me\ntwo of them.\nThanks  .  '

The enhancement

As you can see from the example, if return_str is True, then the tokenize method returns a space-separated string. However, the number of spaces is very inconsistent. Perhaps we would be better off stripping spaces on the ends, and replacing all sequences of multiple spaces with just one space.

E.g.

>>> NLTKWordTokenizer().tokenize(s, return_str=True)
'Good muffins cost $ 3.88 ( roughly 3,36 euros ) \nin New York. Please buy me\ntwo of them.\nThanks .'

instead of

>>> NLTKWordTokenizer().tokenize(s, return_str=True)
' Good muffins cost  $ 3.88  ( roughly 3,36 euros ) \nin New York.  Please buy me\ntwo of them.\nThanks  .  '

I figured I would create an issue for this to find out if others agree with this idea, before I put in the time to make this change for no reason. So, I'd like to hear your thoughts.

Tom Aarsen

The text was updated successfully, but these errors were encountered:

stevenbird · 2021-11-07T23:19:05Z

Could we also consider deprecating this usage... I wonder why we did this!

tomaarsen · 2021-11-08T09:56:50Z

Yeah, sounds good. I can't really imagine a use case for this parameter to begin with.

adamjhawley · 2021-11-09T21:18:52Z

Hi, I would like to pick this up as my first issue if possible. Does NLTK have a standard way of deprecating parameters already? I have seen the @Deprecated decorator but I am not sure how I could use it to deprecate the parameter only.

tomaarsen · 2021-11-10T09:30:33Z

I don't believe so, no. The decorator are only for deprecating classes, methods and functions, iirc.

There's at least three options here:

Raising a warning when return_str is True, but having the functionality the same. In the future, we would remove this functionality, and for now we just warn users. This would be like a slow deprecation.
Raising a warning when return_str is True, and updating the functionality to ignore the return_str parameter. This would be like a fast deprecation.
Simply deleting all references of return_str. This would break implementations that currently use return_str without giving feedback. This would be like a sudden deprecation.

I really don't see the use of return_str, so I don't have issues with going for option 2, so we don't need to revisit this in a few months time. Perhaps other people disagree with me though.

adamjhawley · 2021-11-10T19:22:13Z

As can be seen in the PR above (#2883), I have opted for option 2 in agreement with @tomaarsen's comment but am happy to rework if another solution is preferred.

tomaarsen added enhancement tokenizer labels Nov 7, 2021

tomaarsen added deprecation good first issue and removed enhancement labels Nov 8, 2021

tomaarsen changed the title ~~NLTKWordTokenizer and TreebankWordTokenizer enhancement with return_str~~ Deprecate return_str parameter in NLTKWordTokenizer and TreebankWordTokenizer Nov 8, 2021

adamjhawley mentioned this issue Nov 10, 2021

Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer #2883

Merged

stevenbird closed this as completed in #2883 Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate `return_str` parameter in `NLTKWordTokenizer` and `TreebankWordTokenizer` #2879

Deprecate `return_str` parameter in `NLTKWordTokenizer` and `TreebankWordTokenizer` #2879

tomaarsen commented Nov 7, 2021

stevenbird commented Nov 7, 2021

tomaarsen commented Nov 8, 2021

adamjhawley commented Nov 9, 2021

tomaarsen commented Nov 10, 2021

adamjhawley commented Nov 10, 2021 •

edited

Deprecate return_str parameter in NLTKWordTokenizer and TreebankWordTokenizer #2879

Deprecate return_str parameter in NLTKWordTokenizer and TreebankWordTokenizer #2879

Comments

tomaarsen commented Nov 7, 2021

The enhancement

stevenbird commented Nov 7, 2021

tomaarsen commented Nov 8, 2021

adamjhawley commented Nov 9, 2021

tomaarsen commented Nov 10, 2021

adamjhawley commented Nov 10, 2021 • edited

Deprecate `return_str` parameter in `NLTKWordTokenizer` and `TreebankWordTokenizer` #2879

Deprecate `return_str` parameter in `NLTKWordTokenizer` and `TreebankWordTokenizer` #2879

adamjhawley commented Nov 10, 2021 •

edited