Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordNetLemmatizer not properly lemmatizing some words #2567

Open
gorj-tessella opened this issue Jun 29, 2020 · 11 comments · May be fixed by #3225
Open

WordNetLemmatizer not properly lemmatizing some words #2567

gorj-tessella opened this issue Jun 29, 2020 · 11 comments · May be fixed by #3225
Assignees

Comments

@gorj-tessella
Copy link

Some words are lemmatized improperly, due to picking the smallest possible lemma:

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('dose', 'n') # returns "dose"
lemmatizer.lemmatize('doses', 'n') # returns "dos"
wordnet._morphy('doses', 'n') # returns ["dose", "dos"]
wordnet.morphy('doses', 'n') # returns "dose"
@tomaarsen
Copy link
Member

@gorj-tessella
I've written the following program to quickly get an overview of how WordNetLemmatizer and morphy compare.

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import csv

# Amount of times the two lemmatizers resulted in the same lemma
identical = 0
# Total amount of accepted test cases
total = 0
# The times wordnet or morphy had the better result respectively
wordnet_wins = set()
morphy_wins = set()

wnl = WordNetLemmatizer()
with open("noun.csv", "r", errors='replace') as f:
    reader = csv.reader(f)
    for line in reader:
        singular = line[0]
        plurals = line[1:] # There might be multiple plurals
        
        for plural in plurals:
            # Lemmatize according to WordNetLemmatizer and morphy
            wn_l = wnl.lemmatize(plural, pos="n")
            m_l = wn.morphy(plural, pos="n")
            
            # Ignore if morphy is unable to lemmatize
            if m_l is not None:
                if wn_l != m_l:
                    # If wordnet is right and morphy is not:
                    if wn_l == singular:
                        wordnet_wins.add(plural)
                    # If morphy is right and wordnet is not:
                    if m_l == singular:
                        morphy_wins.add(plural)
                
                else:
                    identical += 1
                total += 1

# In case there are alternate spellings, add them to "tie"
# and remove them from the individual ones
tie = wordnet_wins.intersection(morphy_wins)
wordnet_wins -= tie
morphy_wins -= tie

breakpoint()

(Python 3.7+, Python 3.5+ onwards if you remove the breakpoint())

This program will go through a file "noun.csv", which I downloaded from https://github.com/djstrong/nouns-with-plurals/blob/master/noun.csv.
Each of the ~142000 plurals were lemmatized both by WordNetLemmatizer and morphy. All cases where morphy was unable to find a lemma were discarded. For the remaining cases it's checked whether there's a difference in results. If there is, the method with the correct lemma (if any) had that plural added to its respective set (i.e. wordnet_wins or morphy_wins). If there was no difference, identical is incremented. Lastly total is incremented for each non-discarded lemma.
Afterward, all plurals both in wordnet_wins and in morphy_wins (e.g. if a plural has multiple possible lemmas, and each of the methods produce one of those valid lemmas) will be added to tie, and those plurals will be removed from wordnet_wins and morphy_wins.


Results

Out of more than 142000 plurals, morphy was only able to lemmatize 31613, as it requires the lemma to be a known word according to Wordnet, and many of the plurals in the list are multiple words (snow ploughs) or simply not real words (σ-finite measures). Out of these 31613 test cases, the two methods resulted in the same lemma 30987 times, or ~97.59% of the time.

The remaining times there was a difference. The interesting part here is finding out which one was accurate more often.


For each of these plurals, WordNetLemmatizer is right and morphy is wrong.

['abs', 'acoustics', 'acres', 'aesthetics', 'affairs', 'aides', 'aids', 'allies', 'aloes', 'alps', 'ambages', 'amenities', 'anagrams', 'anas', 'ancients', 'anklets', 'ans', 'antipodes', 'antitrades', 'aras', 'archives', 'ares', 'arms', 'as', 'ascomycetes', 'assets', 'assizes', 'baas', 'balusters', 'banks', 'baptists', 'barrels', 'bars', 'basics', 'basidiomycetes', 'bbs', 'beads', 'beatniks', 'beats', 'bellows', 'bends', 'billings', 'bitters', 'bleachers', 'blinks', 'bloomers', 'blues', 'boards', 'bounds', 'bowels', 'bowls', 'boxcars', 'boxers', 'braces', 'brakes', 'breakers', 'brethren', 'bridges', 'briefs', 'brits', 'brooks', 'bunches', 'buns', 'burnouses', 'burns', 'buttocks', 'callas', 'canaries', 'candlepins', 'canticles', 'cascades', 'ceres', 'chains', 'chambers', 'channels', 'charades', 'checkers', 'cheviots', 'chilblains', 'chips', 'chives', 'circumstances', 'clams', 'clappers', 'classics', 'cleaners', 'cleats', 'cleavers', 'clews', 'clocks', 'clutches', 'cobblers', 'cocos', 'coevals', 'collards', 'comforts', 'comics', 'commons', 'communications', 'compliments', 'conditions', 'congratulations', 'conserves', 'contemporaries', 'contents', 'contras', 'conveniences', 'cords', 'corduroys', 'corrections', 'cos', 'costs', 'cows', 'crabs', 'craps', 'credentials', 'creeps', 'creepy-crawlies', 'crossroads', 'cs', 'customs', 'cyclopes', 'darts', 'das', 'davys', 'days', 'debs', 'deeds', 'descendants', 'deserts', 'devices', 'dialectics', 'diggings', 'digs', 'dippers', 'dominoes', 'dominos', 'dos', 'doubles', 
'dozens', 'draughts', 'drawers', 'dregs', 'duckpins', 'duds', 'dumas', 'dumplings', 'dumps', 'dunkers', 'dynamics', 'eas', 'eggs', 'elements', 'elves', 'ethics', 'eyeglasses', 'eyes', 'falls', 'fas', 'fatigues', 'feelings', 'fesses', 'fields', 'fifties', 'finances', 'findings', 'fives', 'fixings', 'flaps', 'flats', 'flies', 'folks', 'follies', 'followers', 'forties', 'fries', 'fumes', 'fundamentals', 'funds', 'funnies', 'gasteromycetes', 'gates', 'gens', 'giblets', 'glassworks', 'goldfields', 'gospels', 'graphics', 'greaves', 'greens', 'gripes', 'grits', 'groats', 'grotesqueries', 'groves', 'guts', 'gyps', 'hackles', 'hands', 'hanks', 'harmonics', 'hays', 'heaps', 'heavens', 'heaves', 'highlands', 'hindquarters', 'hippies', 'hipsters', 'hives', 'hooks', 
'hoops', 'hops', 'horseshoes', 'hours', 'huaraches', 'humans', 'hurdles', 'hymeneals', 'hysterics', 'indris', 'ins', 'ironsides', 'isometrics', 'jacks', 'jackstraws', 'jaspers', 'jimmies', 'jitters', 'johns', 'judges', 'junkers', 'khakis', 'knobkerries', 'knuckles', 'ks', 'lamentations', 'lancers', 'lashings', 'lats', 'laurels', 'laws', 'leaders', 'lees', 'leftovers', 'legs', 'leotards', 'letters', 'liabilities', 'limbers', 'loads', 'lodgings', 'logos', 'loins', 'loos', 'losses', 'lots', 'lowlands', 'lyons', 'majors', 'manes', 'manners', 'marbles', 'marches', 'marines', 'marks', 'mars', 'marshals', 'masses', 'masters', 'mates', 'maths', 'maulers', 'mays', 'means', 'mechanics', 'medlars', 'megrims', 'methodists', 'metrics', 'mills', 'minors', 'minutes', 'mnemonics', 'mods', 'monas', 'mopes', 'morals', 'mores', 'myxomycetes', 'najas', 'names', 'nerves', 'ninepins', 'nones', 'nothings', 'numbers', 'occasions', 'oddments', 'operations', 'optics', 'organs', 'outskirts', 'oxen', 'pants', 'parks', 'parsons', 'parts', 'peanuts', 'pickings', 'piles', 'pinches', 'plaudits', 'pliers', 'plyers', 'polemics', 'polls', 'pooves', 'porcupines', 
'prelims', 'premises', 'primates', 'privates', 'proceedings', 'profits', 'propaedeutics', 'prophets', 'props', 'proverbs', 'provisions', 'psalms', 'pyrites', 'quadratics', 'queens', 'quoits', 'raffles', 'rafts', 'rails', 'rastas', 'rates', 'receipts', 'relations', 'reserves', 'rings', 'roads', 'rockers', 'rooms', 'roots', 'rounders', 'ruddles', 'rudiments', 'sales', 'sauternes', 
'saxes', 'scads', 'scopes', 'scores', 'scots', 'scraps', 'scruples', 'seats', 'sellers', 'sens', 'services', 'sessions', 'settlings', 'shakers', 'shambles', 'shears', 'shekels', 'shirtsleeves', 'shoes', 'shorts', 'shucks', 'silks', 'sills', 'silversides', 'singles', 'sis', 'skinheads', 'skittles', 'skivvies', 'slews', 'slops', 'snips', 'snuffers', 'sops', 'sos', 'soviets', 'spareribs', 'specs', 'spectacles', 'spillikins', 'spirits', 'splinters', 'spots', 'sprinkles', 'sprites', 'stacks', 'staggers', 'stairs', 'stakes', 'stalls', 'stamina', 'stations', 'stays', 'steps', 'stigmata', 'stockholdings', 'stops', 'straits', 'stripes', 'sweetbreads', 'tabernacles', 'tactics', 'tails', 'talks', 'taps', 'taxis', 'tears', 'teens', 'tenpins', 'terms', 'testudines', 
'things', 'thirties', 'threads', 'throes', 'tigers', 'tons', 'tours', 'transactions', 'trappings', 'trembles', 'trimmings', 'troglodytes', 'troops', 'tropics', 'trumpets', 'trunks', 'tums', 'twenties', 'twins', 'values', 'vapors', 'velours', 'vespers', 'viands', 'vibes', 'victuals', 'viewers', 'waders', 'wads', 'wages', 'wales', 'watts', 'ways', 'weeds', 'wells', 'whiskers', 'windows', 'wings', 'winnings', 'wits', 'woods', 'words', 'workings', 'yaws', 'years', 'yips']

In total, WordNetLemmatizer performs better than morphy on 470 plurals.


For each of these plurals, morphy is right and WordNetLemmatizer is wrong.

['beanies', 'colors', 'corps', 'corpses', 'coss', 'cruses', 'dies', 'doses', 'gendarmeries', 'grazes', 'grounds', 'kurus', 'morses', 'motives', 'muses', 'paraleipses', 'ploughmen', 'raves', 'reeves', 'reverses', 'roomies', 'ruses', 'senses', 'serves', 'sharpies', 'species', 'stogies', 'taeniae', 'touracos', 'uses', 'vases', 'weirdies']

In total, morphy performs better than WordNetLemmatizer on 32 plurals, including your doses.


And for each of these plurals both methods return a different but valid alternative spelling.

['adzes', 'aeries', 'alexanders', 'annexes', 'aunties', 'battle-axes', 'bennies', 'blintzes', 'bogies', 'bolshies', 'booties', 'broadaxes', 'caddies', 'cartouches', 'cookies', 'coolies', 'cowries', 'crosses', 'darkies', 'dearies', 'dickies', 'doggies', 'dogies', 'eyries', 'faeries', 'floozies', 'goonies', 'grannies', 'hankies', 'hoagies', 'honkies', 'innings', 'junkies', 'kelpies', 'lenses', 'links', 'loonies', 'marquises', 'meanies', 'mews', 'mollies', 'oraches', 'organdies', 'panties', 'pas', 'pavises', 'pickaxes', 'pinkies', 'pixies', 'poleaxes', 'punkies', 'quickies', 'reveries', 'scrubs', 'shingles', 'sises', 'smoothies', 'softies', 'statistics', 'stymies', 'thrips', 'townies']

In total, there are 62 occurrences of both methods returning a different but valid alternative spelling.


In conclusion, though choosing the shortest lemma (like WordNetLemmatizer) does sometimes produce problems (e.g. for "doses"), in the grand majority of times it is better than simply taking the first option from _morphy like morphy does. This can be seen for e.g. "abs", where _morphy returns ["abs", "ab"].

So, this is not something that should be changed. WordNetLemmatizer should pick the smallest lemma.

  • Tom Aarsen

@caiw
Copy link

caiw commented Jan 15, 2021

Also where WordNetLemmatizer is wrong:
"possesses" -> "posse"
"ramesses" -> "ram"
"james "-> "jam"
"iss" -> "i"

@Acervans
Copy link

Acervans commented Sep 6, 2022

Also "riding" and "rides" -> "rid"

@novalis
Copy link

novalis commented Sep 13, 2022

anchoresses -> anchor
siped -> sip
trapes -> trap
askeses -> ask
bibless -> bible
bowses -> bow
carses -> car
cates -> cat
cateresses -> cater (pos=v -- it should return nothing, since cateresses is not a verb)
chowses -> chow
hydrases -> hydra
idlesses -> idle
marses -> mars (or mar, with pos=v)
pareses -> par
replicases -> replica
semises -> semi
tared -> tar
taring -> tar
tootses -> toot
torqueses -> torque

@ekaf
Copy link
Contributor

ekaf commented Dec 29, 2023

Like most linguistic tools, lemmatizers are not perfect, their accuracy is an open research problem, and more research is needed in order to achieve perfect lemmatization. This suggests to draw a distinction between open research problems vs. software issues.

(Edited) So, as long as we think that NLTK correctly implements the best known algorithms, there might not be a real software issue. But thanks to @nezda's comment below , we now know that NLTK's implementation cannot be correct.

However, it is also well known that Princeton morphy is overly permissive (it "over-generates"). This is usually considered a convenient feature for analysis, and not a bug. For example, the Princeton wn program recognizes cates as cat, and siped as sip. It is a known feature of the original morphy algorithm, and fixing it would mean implementing a different algorithm.

This suggests fixing only those errors in NLTK's morphy implementation, which do not stem from the original Princeton algorithm.

@novalis
Copy link

novalis commented Dec 29, 2023 via email

@ekaf
Copy link
Contributor

ekaf commented Dec 30, 2023

Yes @novalis, a lexicon of inflected forms would be nice to have. There exists a good one for French, The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French. However, such resources can only handle words within the limits of the vocabulary that they know.

@ekaf
Copy link
Contributor

ekaf commented Dec 30, 2023

Thanks @caiw and @nezda: indeed, the original Morphy from Princeton does not have these bugs, and neither does my Prolog implementation, wn_morphy.pl.
So there must be something wrong with the implementation of morphy in NLTK's wordnet.py module, and we need to worry about this issue.

@ekaf
Copy link
Contributor

ekaf commented Jan 3, 2024

WordNetLemmatizer and morphy are only simple wrappers, that pick just one single lemma from the list of lemmas found by _morphy. The strategies chosen (shortest vs. first found lemma) may not always be adequate, and it is especially problematic that WordNetLemmatizer just returns any garbage input as its own lemma. However, I don't think these wrappers need fixing. Instead, alternative stemmers are available, and more can be developed.

But we can, and should, fix the main _morphy function in wordnet.py, which returns a list of lemma candidates. It is mostly a faithful implementation of Princeton Morphy, except from one huge difference: whereas the original only does one pass over the possible morphological substitutions, the NLTK implementation can recursively keep removing suffixes. This practice dates back to the original pywordnet by @osteele, but it is not a part of the original algorithm from Princeton Morphy.

For example, without this spurious recursion step, no lemma would be found for "iss" using morphy, because the "s" ending would only be stripped once.

@ekaf ekaf self-assigned this Jan 3, 2024
@nezda
Copy link

nezda commented Jan 3, 2024

now that's what i call a deep dive - nice one @ekaf

@ekaf ekaf linked a pull request Jan 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants