Avoid recursive suffix stripping in wordnet morphy #3225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix #2567: the implementation of Wordnet's __morphy lemmatizer in NLTK is buggy, because it adds a recursive step, which is not a part of the morphy program, as implemented in the "morph.c" source file that was included with the original Princeton WordNet, and described in the morphy manual.
So currently, when one pass over the possible morphological substitutions does not yield any known lemma, additional passes may be applied on the results. For ex.:
After this PR, the "s" suffix is only stripped once, leading to the rejection of the bad "catss" form:
This last result agrees with the official Princeton output, as well as with the implementation in Wn by @goodmami.
Similarly, the following errors do not occur after this PR:
On the other hand. this PR does not change the results of @tomaarsen's plurals test: WordNetLemmatizer still performs better 470 times, morphy 32 times, and there are 62 ties.
However, this PR proposes to add a few comments in WordNetLemmatizer, to make it more clear that its lemmatize() function picks the shortest lemma among the candidates returned by _morphy, and that this behaviour is not a bug but a feature.