Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid recursive suffix stripping in wordnet morphy #3225

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Jan 5, 2024

Fix #2567: the implementation of Wordnet's __morphy lemmatizer in NLTK is buggy, because it adds a recursive step, which is not a part of the morphy program, as implemented in the "morph.c" source file that was included with the original Princeton WordNet, and described in the morphy manual.

So currently, when one pass over the possible morphological substitutions does not yield any known lemma, additional passes may be applied on the results. For ex.:

from nltk.corpus import wordnet as wn

for w in ['cats', 'catss']:
    print(f"{w} -> {wn._morphy(w, pos='n')}")

cats -> ['cat']
catss -> ['cat']

After this PR, the "s" suffix is only stripped once, leading to the rejection of the bad "catss" form:

cats -> ['cat']
catss -> []

This last result agrees with the official Princeton output, as well as with the implementation in Wn by @goodmami.

Similarly, the following errors do not occur after this PR:

Call Before After
_morphy('possesses', 'n') ['posse'] []
_morphy('ramesses', 'v') ['ram'] []
_morphy('iss', 'n') ['i'] []
_morphy('anchoresses', 'v') ['anchor'] []
_morphy('askeses', 'v') ['ask'] []
_morphy('bibless', 'n') ['bible'] []
_morphy('bowses', 'n') ['bow'] []
_morphy('carses', 'n') ['car'] []
_morphy('cateresses', 'v') ['cater'] []
_morphy('chowses', 'n') ['chow'] []
_morphy('hydrases', 'n') ['hydra'] []
_morphy('idlesses', 'n') ['idle'] []
_morphy('idlesses', 'v') ['idle'] []
_morphy('marses', 'v') ['mar'] []
_morphy('pareses', 'v') ['pare'', ' 'par'] []
_morphy('replicases', 'n') ['replica'] []
_morphy('semises', 'n') ['semi'] []
_morphy('tootses', 'n') ['toot'] []
_morphy('tootses', 'v') ['toot'] []
_morphy('torqueses', 'n') ['torque'] []

On the other hand. this PR does not change the results of @tomaarsen's plurals test: WordNetLemmatizer still performs better 470 times, morphy 32 times, and there are 62 ties.

However, this PR proposes to add a few comments in WordNetLemmatizer, to make it more clear that its lemmatize() function picks the shortest lemma among the candidates returned by _morphy, and that this behaviour is not a bug but a feature.

@ekaf ekaf marked this pull request as draft January 13, 2024 09:33
@ekaf
Copy link
Contributor Author

ekaf commented Jan 13, 2024

Converted to draft, as it seems possible to handle more of the issues related to WordNetLemmatizer,.

@ekaf ekaf marked this pull request as ready for review January 14, 2024 08:47
@ekaf
Copy link
Contributor Author

ekaf commented Jan 14, 2024

The historical WordNet lemmatizer is Morphy, so many users would intuitively expect a more standard behaviour from WordNetLemmatizer.lemmatize(). But instead, that wrapper is defined by non-standard features: it defaults to nouns, picks the shortest lemma, and eventually accepts any word not included in WordNet.
So this PR proposes to also fix #1978 and #3227, by adding an alias to wordnet's _morphy() and morphy() functions, for those users of the WordNetLemmatizer class who want access to a more standard
WordNet lemmatizer.
On the other hand, this PR leaves the lemmatize() function unchanged, for the many users who have been accustomed to its non-standard behaviour for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WordNetLemmatizer not properly lemmatizing some words
1 participant