New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate return_str
parameter in NLTKWordTokenizer
and TreebankWordTokenizer
#2879
Comments
Could we also consider deprecating this usage... I wonder why we did this! |
Yeah, sounds good. I can't really imagine a use case for this parameter to begin with. |
NLTKWordTokenizer
and TreebankWordTokenizer
enhancement with return_str
return_str
parameter in NLTKWordTokenizer
and TreebankWordTokenizer
Hi, I would like to pick this up as my first issue if possible. Does NLTK have a standard way of deprecating parameters already? I have seen the |
I don't believe so, no. The decorator are only for deprecating classes, methods and functions, iirc. There's at least three options here:
I really don't see the use of |
As can be seen in the PR above (#2883), I have opted for option 2 in agreement with @tomaarsen's comment but am happy to rework if another solution is preferred. |
Hello!
I'd like to discuss a potential enhancements of
NLTKWordTokenizer
andTreebankWordTokenizer
. For those unaware, the former is the tokenizer that is most frequently used, and is used in theword_tokenize
function. It's also based on the latter class:TreebankWordTokenizer
.An example usage as can be found in the documentation:
The enhancement
As you can see from the example, if
return_str
is True, then thetokenize
method returns a space-separated string. However, the number of spaces is very inconsistent. Perhaps we would be better off stripping spaces on the ends, and replacing all sequences of multiple spaces with just one space.E.g.
instead of
I figured I would create an issue for this to find out if others agree with this idea, before I put in the time to make this change for no reason. So, I'd like to hear your thoughts.
The text was updated successfully, but these errors were encountered: