Avoid recursive suffix stripping in wordnet morphy #3225

ekaf · 2024-01-05T08:43:30Z

Fix #2567: the implementation of Wordnet's __morphy lemmatizer in NLTK is buggy, because it adds a recursive step, which is not a part of the morphy program, as implemented in the "morph.c" source file that was included with the original Princeton WordNet, and described in the morphy manual.

So currently, when one pass over the possible morphological substitutions does not yield any known lemma, additional passes may be applied on the results. For ex.:

from nltk.corpus import wordnet as wn

for w in ['cats', 'catss']:
    print(f"{w} -> {wn._morphy(w, pos='n')}")

cats -> ['cat']
catss -> ['cat']

After this PR, the "s" suffix is only stripped once, leading to the rejection of the bad "catss" form:

cats -> ['cat']
catss -> []

This last result agrees with the official Princeton output, as well as with the implementation in Wn by @goodmami.

Similarly, the following errors do not occur after this PR:

Call	Before	After
_morphy('possesses', 'n')	['posse']	[]
_morphy('ramesses', 'v')	['ram']	[]
_morphy('iss', 'n')	['i']	[]
_morphy('anchoresses', 'v')	['anchor']	[]
_morphy('askeses', 'v')	['ask']	[]
_morphy('bibless', 'n')	['bible']	[]
_morphy('bowses', 'n')	['bow']	[]
_morphy('carses', 'n')	['car']	[]
_morphy('cateresses', 'v')	['cater']	[]
_morphy('chowses', 'n')	['chow']	[]
_morphy('hydrases', 'n')	['hydra']	[]
_morphy('idlesses', 'n')	['idle']	[]
_morphy('idlesses', 'v')	['idle']	[]
_morphy('marses', 'v')	['mar']	[]
_morphy('pareses', 'v')	['pare'', ' 'par']	[]
_morphy('replicases', 'n')	['replica']	[]
_morphy('semises', 'n')	['semi']	[]
_morphy('tootses', 'n')	['toot']	[]
_morphy('tootses', 'v')	['toot']	[]
_morphy('torqueses', 'n')	['torque']	[]

On the other hand. this PR does not change the results of @tomaarsen's plurals test: WordNetLemmatizer still performs better 470 times, morphy 32 times, and there are 62 ties.

However, this PR proposes to add a few comments in WordNetLemmatizer, to make it more clear that its lemmatize() function picks the shortest lemma among the candidates returned by _morphy, and that this behaviour is not a bug but a feature.

ekaf · 2024-01-13T09:39:23Z

Converted to draft, as it seems possible to handle more of the issues related to WordNetLemmatizer,.

ekaf · 2024-01-14T09:01:14Z

The historical WordNet lemmatizer is Morphy, so many users would intuitively expect a more standard behaviour from WordNetLemmatizer.lemmatize(). But instead, that wrapper is defined by non-standard features: it defaults to nouns, picks the shortest lemma, and eventually accepts any word not included in WordNet.
So this PR proposes to also fix #1978 and #3227, by adding an alias to wordnet's _morphy() and morphy() functions, for those users of the WordNetLemmatizer class who want access to a more standard
WordNet lemmatizer.
On the other hand, this PR leaves the lemmatize() function unchanged, for the many users who have been accustomed to its non-standard behaviour for a long time.

ekaf added 3 commits January 4, 2024 06:44

Avoid recursively stripping suffixes in _morphy

557dda3

Make clear that WordNetLemmatizer picks the shortest possible lemma

27ef06a

Edit WordNetLemmatizer docstring

27a618f

github-actions bot added corpus stem/lemma labels Jan 5, 2024

ekaf requested a review from tomaarsen January 5, 2024 08:54

ekaf added 2 commits January 12, 2024 09:21

Shorten _morphy code

36fa928

Optimize morphy()

db66575

ekaf marked this pull request as draft January 13, 2024 09:33

ekaf added 2 commits January 13, 2024 19:50

Add morphy modes to WordNetLemmatizer

2fa4924

Further simplify morphy()

b553ace

ekaf marked this pull request as ready for review January 14, 2024 08:47

This was referenced Jan 14, 2024

Wordnet synsets query problem #2441

Open

Consistent pos argument between wn.synsets() and WordNetLemmatizer.lemmatize() #1978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid recursive suffix stripping in wordnet morphy #3225

Avoid recursive suffix stripping in wordnet morphy #3225

ekaf commented Jan 5, 2024 •

edited

ekaf commented Jan 13, 2024 •

edited

ekaf commented Jan 14, 2024 •

edited

Avoid recursive suffix stripping in wordnet morphy #3225

Are you sure you want to change the base?

Avoid recursive suffix stripping in wordnet morphy #3225

Conversation

ekaf commented Jan 5, 2024 • edited

ekaf commented Jan 13, 2024 • edited

ekaf commented Jan 14, 2024 • edited

ekaf commented Jan 5, 2024 •

edited

ekaf commented Jan 13, 2024 •

edited

ekaf commented Jan 14, 2024 •

edited