WordNetLemmatizer not properly lemmatizing some words #2567

gorj-tessella · 2020-06-29T19:20:21Z

Some words are lemmatized improperly, due to picking the smallest possible lemma:

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('dose', 'n') # returns "dose"
lemmatizer.lemmatize('doses', 'n') # returns "dos"
wordnet._morphy('doses', 'n') # returns ["dose", "dos"]
wordnet.morphy('doses', 'n') # returns "dose"

tomaarsen · 2020-10-23T16:17:10Z

@gorj-tessella
I've written the following program to quickly get an overview of how WordNetLemmatizer and morphy compare.

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import csv

# Amount of times the two lemmatizers resulted in the same lemma
identical = 0
# Total amount of accepted test cases
total = 0
# The times wordnet or morphy had the better result respectively
wordnet_wins = set()
morphy_wins = set()

wnl = WordNetLemmatizer()
with open("noun.csv", "r", errors='replace') as f:
    reader = csv.reader(f)
    for line in reader:
        singular = line[0]
        plurals = line[1:] # There might be multiple plurals
        
        for plural in plurals:
            # Lemmatize according to WordNetLemmatizer and morphy
            wn_l = wnl.lemmatize(plural, pos="n")
            m_l = wn.morphy(plural, pos="n")
            
            # Ignore if morphy is unable to lemmatize
            if m_l is not None:
                if wn_l != m_l:
                    # If wordnet is right and morphy is not:
                    if wn_l == singular:
                        wordnet_wins.add(plural)
                    # If morphy is right and wordnet is not:
                    if m_l == singular:
                        morphy_wins.add(plural)
                
                else:
                    identical += 1
                total += 1

# In case there are alternate spellings, add them to "tie"
# and remove them from the individual ones
tie = wordnet_wins.intersection(morphy_wins)
wordnet_wins -= tie
morphy_wins -= tie

breakpoint()

(Python 3.7+, Python 3.5+ onwards if you remove the breakpoint())

This program will go through a file "noun.csv", which I downloaded from https://github.com/djstrong/nouns-with-plurals/blob/master/noun.csv.
Each of the ~142000 plurals were lemmatized both by WordNetLemmatizer and morphy. All cases where morphy was unable to find a lemma were discarded. For the remaining cases it's checked whether there's a difference in results. If there is, the method with the correct lemma (if any) had that plural added to its respective set (i.e. wordnet_wins or morphy_wins). If there was no difference, identical is incremented. Lastly total is incremented for each non-discarded lemma.
Afterward, all plurals both in wordnet_wins and in morphy_wins (e.g. if a plural has multiple possible lemmas, and each of the methods produce one of those valid lemmas) will be added to tie, and those plurals will be removed from wordnet_wins and morphy_wins.

Results

Out of more than 142000 plurals, morphy was only able to lemmatize 31613, as it requires the lemma to be a known word according to Wordnet, and many of the plurals in the list are multiple words (snow ploughs) or simply not real words (σ-finite measures). Out of these 31613 test cases, the two methods resulted in the same lemma 30987 times, or ~97.59% of the time.

The remaining times there was a difference. The interesting part here is finding out which one was accurate more often.

For each of these plurals, WordNetLemmatizer is right and morphy is wrong.

['abs', 'acoustics', 'acres', 'aesthetics', 'affairs', 'aides', 'aids', 'allies', 'aloes', 'alps', 'ambages', 'amenities', 'anagrams', 'anas', 'ancients', 'anklets', 'ans', 'antipodes', 'antitrades', 'aras', 'archives', 'ares', 'arms', 'as', 'ascomycetes', 'assets', 'assizes', 'baas', 'balusters', 'banks', 'baptists', 'barrels', 'bars', 'basics', 'basidiomycetes', 'bbs', 'beads', 'beatniks', 'beats', 'bellows', 'bends', 'billings', 'bitters', 'bleachers', 'blinks', 'bloomers', 'blues', 'boards', 'bounds', 'bowels', 'bowls', 'boxcars', 'boxers', 'braces', 'brakes', 'breakers', 'brethren', 'bridges', 'briefs', 'brits', 'brooks', 'bunches', 'buns', 'burnouses', 'burns', 'buttocks', 'callas', 'canaries', 'candlepins', 'canticles', 'cascades', 'ceres', 'chains', 'chambers', 'channels', 'charades', 'checkers', 'cheviots', 'chilblains', 'chips', 'chives', 'circumstances', 'clams', 'clappers', 'classics', 'cleaners', 'cleats', 'cleavers', 'clews', 'clocks', 'clutches', 'cobblers', 'cocos', 'coevals', 'collards', 'comforts', 'comics', 'commons', 'communications', 'compliments', 'conditions', 'congratulations', 'conserves', 'contemporaries', 'contents', 'contras', 'conveniences', 'cords', 'corduroys', 'corrections', 'cos', 'costs', 'cows', 'crabs', 'craps', 'credentials', 'creeps', 'creepy-crawlies', 'crossroads', 'cs', 'customs', 'cyclopes', 'darts', 'das', 'davys', 'days', 'debs', 'deeds', 'descendants', 'deserts', 'devices', 'dialectics', 'diggings', 'digs', 'dippers', 'dominoes', 'dominos', 'dos', 'doubles', 
'dozens', 'draughts', 'drawers', 'dregs', 'duckpins', 'duds', 'dumas', 'dumplings', 'dumps', 'dunkers', 'dynamics', 'eas', 'eggs', 'elements', 'elves', 'ethics', 'eyeglasses', 'eyes', 'falls', 'fas', 'fatigues', 'feelings', 'fesses', 'fields', 'fifties', 'finances', 'findings', 'fives', 'fixings', 'flaps', 'flats', 'flies', 'folks', 'follies', 'followers', 'forties', 'fries', 'fumes', 'fundamentals', 'funds', 'funnies', 'gasteromycetes', 'gates', 'gens', 'giblets', 'glassworks', 'goldfields', 'gospels', 'graphics', 'greaves', 'greens', 'gripes', 'grits', 'groats', 'grotesqueries', 'groves', 'guts', 'gyps', 'hackles', 'hands', 'hanks', 'harmonics', 'hays', 'heaps', 'heavens', 'heaves', 'highlands', 'hindquarters', 'hippies', 'hipsters', 'hives', 'hooks', 
'hoops', 'hops', 'horseshoes', 'hours', 'huaraches', 'humans', 'hurdles', 'hymeneals', 'hysterics', 'indris', 'ins', 'ironsides', 'isometrics', 'jacks', 'jackstraws', 'jaspers', 'jimmies', 'jitters', 'johns', 'judges', 'junkers', 'khakis', 'knobkerries', 'knuckles', 'ks', 'lamentations', 'lancers', 'lashings', 'lats', 'laurels', 'laws', 'leaders', 'lees', 'leftovers', 'legs', 'leotards', 'letters', 'liabilities', 'limbers', 'loads', 'lodgings', 'logos', 'loins', 'loos', 'losses', 'lots', 'lowlands', 'lyons', 'majors', 'manes', 'manners', 'marbles', 'marches', 'marines', 'marks', 'mars', 'marshals', 'masses', 'masters', 'mates', 'maths', 'maulers', 'mays', 'means', 'mechanics', 'medlars', 'megrims', 'methodists', 'metrics', 'mills', 'minors', 'minutes', 'mnemonics', 'mods', 'monas', 'mopes', 'morals', 'mores', 'myxomycetes', 'najas', 'names', 'nerves', 'ninepins', 'nones', 'nothings', 'numbers', 'occasions', 'oddments', 'operations', 'optics', 'organs', 'outskirts', 'oxen', 'pants', 'parks', 'parsons', 'parts', 'peanuts', 'pickings', 'piles', 'pinches', 'plaudits', 'pliers', 'plyers', 'polemics', 'polls', 'pooves', 'porcupines', 
'prelims', 'premises', 'primates', 'privates', 'proceedings', 'profits', 'propaedeutics', 'prophets', 'props', 'proverbs', 'provisions', 'psalms', 'pyrites', 'quadratics', 'queens', 'quoits', 'raffles', 'rafts', 'rails', 'rastas', 'rates', 'receipts', 'relations', 'reserves', 'rings', 'roads', 'rockers', 'rooms', 'roots', 'rounders', 'ruddles', 'rudiments', 'sales', 'sauternes', 
'saxes', 'scads', 'scopes', 'scores', 'scots', 'scraps', 'scruples', 'seats', 'sellers', 'sens', 'services', 'sessions', 'settlings', 'shakers', 'shambles', 'shears', 'shekels', 'shirtsleeves', 'shoes', 'shorts', 'shucks', 'silks', 'sills', 'silversides', 'singles', 'sis', 'skinheads', 'skittles', 'skivvies', 'slews', 'slops', 'snips', 'snuffers', 'sops', 'sos', 'soviets', 'spareribs', 'specs', 'spectacles', 'spillikins', 'spirits', 'splinters', 'spots', 'sprinkles', 'sprites', 'stacks', 'staggers', 'stairs', 'stakes', 'stalls', 'stamina', 'stations', 'stays', 'steps', 'stigmata', 'stockholdings', 'stops', 'straits', 'stripes', 'sweetbreads', 'tabernacles', 'tactics', 'tails', 'talks', 'taps', 'taxis', 'tears', 'teens', 'tenpins', 'terms', 'testudines', 
'things', 'thirties', 'threads', 'throes', 'tigers', 'tons', 'tours', 'transactions', 'trappings', 'trembles', 'trimmings', 'troglodytes', 'troops', 'tropics', 'trumpets', 'trunks', 'tums', 'twenties', 'twins', 'values', 'vapors', 'velours', 'vespers', 'viands', 'vibes', 'victuals', 'viewers', 'waders', 'wads', 'wages', 'wales', 'watts', 'ways', 'weeds', 'wells', 'whiskers', 'windows', 'wings', 'winnings', 'wits', 'woods', 'words', 'workings', 'yaws', 'years', 'yips']

In total, WordNetLemmatizer performs better than morphy on 470 plurals.

For each of these plurals, morphy is right and WordNetLemmatizer is wrong.

['beanies', 'colors', 'corps', 'corpses', 'coss', 'cruses', 'dies', 'doses', 'gendarmeries', 'grazes', 'grounds', 'kurus', 'morses', 'motives', 'muses', 'paraleipses', 'ploughmen', 'raves', 'reeves', 'reverses', 'roomies', 'ruses', 'senses', 'serves', 'sharpies', 'species', 'stogies', 'taeniae', 'touracos', 'uses', 'vases', 'weirdies']

In total, morphy performs better than WordNetLemmatizer on 32 plurals, including your doses.

And for each of these plurals both methods return a different but valid alternative spelling.

['adzes', 'aeries', 'alexanders', 'annexes', 'aunties', 'battle-axes', 'bennies', 'blintzes', 'bogies', 'bolshies', 'booties', 'broadaxes', 'caddies', 'cartouches', 'cookies', 'coolies', 'cowries', 'crosses', 'darkies', 'dearies', 'dickies', 'doggies', 'dogies', 'eyries', 'faeries', 'floozies', 'goonies', 'grannies', 'hankies', 'hoagies', 'honkies', 'innings', 'junkies', 'kelpies', 'lenses', 'links', 'loonies', 'marquises', 'meanies', 'mews', 'mollies', 'oraches', 'organdies', 'panties', 'pas', 'pavises', 'pickaxes', 'pinkies', 'pixies', 'poleaxes', 'punkies', 'quickies', 'reveries', 'scrubs', 'shingles', 'sises', 'smoothies', 'softies', 'statistics', 'stymies', 'thrips', 'townies']

In total, there are 62 occurrences of both methods returning a different but valid alternative spelling.

In conclusion, though choosing the shortest lemma (like WordNetLemmatizer) does sometimes produce problems (e.g. for "doses"), in the grand majority of times it is better than simply taking the first option from _morphy like morphy does. This can be seen for e.g. "abs", where _morphy returns ["abs", "ab"].

So, this is not something that should be changed. WordNetLemmatizer should pick the smallest lemma.

Tom Aarsen

caiw · 2021-01-15T13:12:12Z

Also where WordNetLemmatizer is wrong:
"possesses" -> "posse"
"ramesses" -> "ram"
"james "-> "jam"
"iss" -> "i"

Acervans · 2022-09-06T10:12:26Z

Also "riding" and "rides" -> "rid"

novalis · 2022-09-13T15:01:54Z

anchoresses -> anchor
siped -> sip
trapes -> trap
askeses -> ask
bibless -> bible
bowses -> bow
carses -> car
cates -> cat
cateresses -> cater (pos=v -- it should return nothing, since cateresses is not a verb)
chowses -> chow
hydrases -> hydra
idlesses -> idle
marses -> mars (or mar, with pos=v)
pareses -> par
replicases -> replica
semises -> semi
tared -> tar
taring -> tar
tootses -> toot
torqueses -> torque

ekaf · 2023-12-29T08:29:53Z

Like most linguistic tools, lemmatizers are not perfect, their accuracy is an open research problem, and more research is needed in order to achieve perfect lemmatization. This suggests to draw a distinction between open research problems vs. software issues.

(Edited) So, as long as we think that NLTK correctly implements the best known algorithms, there might not be a real software issue. But thanks to @nezda's comment below , we now know that NLTK's implementation cannot be correct.

However, it is also well known that Princeton morphy is overly permissive (it "over-generates"). This is usually considered a convenient feature for analysis, and not a bug. For example, the Princeton wn program recognizes cates as cat, and siped as sip. It is a known feature of the original morphy algorithm, and fixing it would mean implementing a different algorithm.

This suggests fixing only those errors in NLTK's morphy implementation, which do not stem from the original Princeton algorithm.

novalis · 2023-12-29T14:44:23Z

A lookup table would be 100% accurate (for cases that are included), and would not require new research. It would require some sweat, but we have the start of such a table right here in this GitHub issue.

…

On December 29, 2023 3:30:04 AM EST, Eric Kafe ***@***.***> wrote: Like most linguistic tools, lemmatizers are not perfect, their accuracy is an open research problem, and more research is needed in order to achieve perfect lemmatization. This suggests to draw a distinction between open research problems vs. software issues. For ex. here, where NLTK correctly implements the best known algorithms, there does not seem to be a software issue. -- Reply to this email directly or view it on GitHub: #2567 (comment) You are receiving this because you commented. Message ID: ***@***.***>

nezda · 2023-12-29T20:41:45Z

Seems like the original WordNet's Morphy implementation doesn't have many of these bugs and maybe these are in fact bugs in this implementation? For example:

http://wordnetweb.princeton.edu/perl/webwn?s=possesses&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h= - no "possesses" → "posse"
http://wordnetweb.princeton.edu/perl/webwn?s=ramesses&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=0000000000000000 - no "ramesses" → "ram"
http://wordnetweb.princeton.edu/perl/webwn?s=james&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=0 - no "james " → "jam"
http://wordnetweb.princeton.edu/perl/webwn?s=iss&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=0000000000 - no "iss" → "i"

ekaf · 2023-12-30T09:30:41Z

Yes @novalis, a lexicon of inflected forms would be nice to have. There exists a good one for French, The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French. However, such resources can only handle words within the limits of the vocabulary that they know.

ekaf · 2023-12-30T09:50:46Z

Thanks @caiw and @nezda: indeed, the original Morphy from Princeton does not have these bugs, and neither does my Prolog implementation, wn_morphy.pl.
So there must be something wrong with the implementation of morphy in NLTK's wordnet.py module, and we need to worry about this issue.

ekaf · 2024-01-03T12:21:41Z

WordNetLemmatizer and morphy are only simple wrappers, that pick just one single lemma from the list of lemmas found by _morphy. The strategies chosen (shortest vs. first found lemma) may not always be adequate, and it is especially problematic that WordNetLemmatizer just returns any garbage input as its own lemma. However, I don't think these wrappers need fixing. Instead, alternative stemmers are available, and more can be developed.

But we can, and should, fix the main _morphy function in wordnet.py, which returns a list of lemma candidates. It is mostly a faithful implementation of Princeton Morphy, except from one huge difference: whereas the original only does one pass over the possible morphological substitutions, the NLTK implementation can recursively keep removing suffixes. This practice dates back to the original pywordnet by @osteele, but it is not a part of the original algorithm from Princeton Morphy.

For example, without this spurious recursion step, no lemma would be found for "iss" using morphy, because the "s" ending would only be stripped once.

nezda · 2024-01-03T17:06:04Z

now that's what i call a deep dive - nice one @ekaf

tomaarsen mentioned this issue Jan 19, 2022

[question] "us" lemmatizes into "u"? #2930

Closed

ekaf self-assigned this Jan 3, 2024

ekaf linked a pull request Jan 5, 2024 that will close this issue

Avoid recursive suffix stripping in wordnet morphy #3225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordNetLemmatizer not properly lemmatizing some words #2567

WordNetLemmatizer not properly lemmatizing some words #2567

gorj-tessella commented Jun 29, 2020

tomaarsen commented Oct 23, 2020

caiw commented Jan 15, 2021

Acervans commented Sep 6, 2022 •

edited

novalis commented Sep 13, 2022 •

edited

ekaf commented Dec 29, 2023 •

edited

novalis commented Dec 29, 2023 via email

nezda commented Dec 29, 2023

ekaf commented Dec 30, 2023

ekaf commented Dec 30, 2023

ekaf commented Jan 3, 2024 •

edited

nezda commented Jan 3, 2024

WordNetLemmatizer not properly lemmatizing some words #2567

WordNetLemmatizer not properly lemmatizing some words #2567

Comments

gorj-tessella commented Jun 29, 2020

tomaarsen commented Oct 23, 2020

Results

caiw commented Jan 15, 2021

Acervans commented Sep 6, 2022 • edited

novalis commented Sep 13, 2022 • edited

ekaf commented Dec 29, 2023 • edited

novalis commented Dec 29, 2023 via email

nezda commented Dec 29, 2023

ekaf commented Dec 30, 2023

ekaf commented Dec 30, 2023

ekaf commented Jan 3, 2024 • edited

nezda commented Jan 3, 2024

Acervans commented Sep 6, 2022 •

edited

novalis commented Sep 13, 2022 •

edited

ekaf commented Dec 29, 2023 •

edited

ekaf commented Jan 3, 2024 •

edited