Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate return_str parameter in NLTKWordTokenizer and TreebankWordTokenizer #2879

Closed
tomaarsen opened this issue Nov 7, 2021 · 5 comments · Fixed by #2883
Closed

Deprecate return_str parameter in NLTKWordTokenizer and TreebankWordTokenizer #2879

tomaarsen opened this issue Nov 7, 2021 · 5 comments · Fixed by #2883

Comments

@tomaarsen
Copy link
Member

Hello!

I'd like to discuss a potential enhancements of NLTKWordTokenizer and TreebankWordTokenizer. For those unaware, the former is the tokenizer that is most frequently used, and is used in the word_tokenize function. It's also based on the latter class: TreebankWordTokenizer.

An example usage as can be found in the documentation:

>>> from nltk.tokenize import NLTKWordTokenizer
>>> s = '''Good muffins cost $3.88 (roughly 3,36 euros)\nin New York.  Please buy me\ntwo of them.\nThanks.'''
>>> NLTKWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', '(', 'roughly', '3,36',
'euros', ')', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two',
'of', 'them.', 'Thanks', '.']
>>> NLTKWordTokenizer().tokenize(s, convert_parentheses=True)
['Good', 'muffins', 'cost', '$', '3.88', '-LRB-', 'roughly', '3,36',
'euros', '-RRB-', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two',
'of', 'them.', 'Thanks', '.']
>>> NLTKWordTokenizer().tokenize(s, return_str=True)
' Good muffins cost  $ 3.88  ( roughly 3,36 euros ) \nin New York.  Please buy me\ntwo of them.\nThanks  .  '

The enhancement

As you can see from the example, if return_str is True, then the tokenize method returns a space-separated string. However, the number of spaces is very inconsistent. Perhaps we would be better off stripping spaces on the ends, and replacing all sequences of multiple spaces with just one space.

E.g.

>>> NLTKWordTokenizer().tokenize(s, return_str=True)
'Good muffins cost $ 3.88 ( roughly 3,36 euros ) \nin New York. Please buy me\ntwo of them.\nThanks .'

instead of

>>> NLTKWordTokenizer().tokenize(s, return_str=True)
' Good muffins cost  $ 3.88  ( roughly 3,36 euros ) \nin New York.  Please buy me\ntwo of them.\nThanks  .  '

I figured I would create an issue for this to find out if others agree with this idea, before I put in the time to make this change for no reason. So, I'd like to hear your thoughts.

  • Tom Aarsen
@stevenbird
Copy link
Member

Could we also consider deprecating this usage... I wonder why we did this!

@tomaarsen
Copy link
Member Author

Yeah, sounds good. I can't really imagine a use case for this parameter to begin with.

@tomaarsen tomaarsen changed the title NLTKWordTokenizer and TreebankWordTokenizer enhancement with return_str Deprecate return_str parameter in NLTKWordTokenizer and TreebankWordTokenizer Nov 8, 2021
@adamjhawley
Copy link
Contributor

Hi, I would like to pick this up as my first issue if possible. Does NLTK have a standard way of deprecating parameters already? I have seen the @Deprecated decorator but I am not sure how I could use it to deprecate the parameter only.

@tomaarsen
Copy link
Member Author

I don't believe so, no. The decorator are only for deprecating classes, methods and functions, iirc.

There's at least three options here:

  1. Raising a warning when return_str is True, but having the functionality the same. In the future, we would remove this functionality, and for now we just warn users. This would be like a slow deprecation.
  2. Raising a warning when return_str is True, and updating the functionality to ignore the return_str parameter. This would be like a fast deprecation.
  3. Simply deleting all references of return_str. This would break implementations that currently use return_str without giving feedback. This would be like a sudden deprecation.

I really don't see the use of return_str, so I don't have issues with going for option 2, so we don't need to revisit this in a few months time. Perhaps other people disagree with me though.

@adamjhawley
Copy link
Contributor

adamjhawley commented Nov 10, 2021

As can be seen in the PR above (#2883), I have opted for option 2 in agreement with @tomaarsen's comment but am happy to rework if another solution is preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants