New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WordNetLemmatizer not properly lemmatizing some words #2567
Comments
@gorj-tessella from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import csv
# Amount of times the two lemmatizers resulted in the same lemma
identical = 0
# Total amount of accepted test cases
total = 0
# The times wordnet or morphy had the better result respectively
wordnet_wins = set()
morphy_wins = set()
wnl = WordNetLemmatizer()
with open("noun.csv", "r", errors='replace') as f:
reader = csv.reader(f)
for line in reader:
singular = line[0]
plurals = line[1:] # There might be multiple plurals
for plural in plurals:
# Lemmatize according to WordNetLemmatizer and morphy
wn_l = wnl.lemmatize(plural, pos="n")
m_l = wn.morphy(plural, pos="n")
# Ignore if morphy is unable to lemmatize
if m_l is not None:
if wn_l != m_l:
# If wordnet is right and morphy is not:
if wn_l == singular:
wordnet_wins.add(plural)
# If morphy is right and wordnet is not:
if m_l == singular:
morphy_wins.add(plural)
else:
identical += 1
total += 1
# In case there are alternate spellings, add them to "tie"
# and remove them from the individual ones
tie = wordnet_wins.intersection(morphy_wins)
wordnet_wins -= tie
morphy_wins -= tie
breakpoint() (Python 3.7+, Python 3.5+ onwards if you remove the This program will go through a file ResultsOut of more than 142000 plurals, morphy was only able to lemmatize 31613, as it requires the lemma to be a known word according to Wordnet, and many of the plurals in the list are multiple words ( The remaining times there was a difference. The interesting part here is finding out which one was accurate more often. For each of these plurals, ['abs', 'acoustics', 'acres', 'aesthetics', 'affairs', 'aides', 'aids', 'allies', 'aloes', 'alps', 'ambages', 'amenities', 'anagrams', 'anas', 'ancients', 'anklets', 'ans', 'antipodes', 'antitrades', 'aras', 'archives', 'ares', 'arms', 'as', 'ascomycetes', 'assets', 'assizes', 'baas', 'balusters', 'banks', 'baptists', 'barrels', 'bars', 'basics', 'basidiomycetes', 'bbs', 'beads', 'beatniks', 'beats', 'bellows', 'bends', 'billings', 'bitters', 'bleachers', 'blinks', 'bloomers', 'blues', 'boards', 'bounds', 'bowels', 'bowls', 'boxcars', 'boxers', 'braces', 'brakes', 'breakers', 'brethren', 'bridges', 'briefs', 'brits', 'brooks', 'bunches', 'buns', 'burnouses', 'burns', 'buttocks', 'callas', 'canaries', 'candlepins', 'canticles', 'cascades', 'ceres', 'chains', 'chambers', 'channels', 'charades', 'checkers', 'cheviots', 'chilblains', 'chips', 'chives', 'circumstances', 'clams', 'clappers', 'classics', 'cleaners', 'cleats', 'cleavers', 'clews', 'clocks', 'clutches', 'cobblers', 'cocos', 'coevals', 'collards', 'comforts', 'comics', 'commons', 'communications', 'compliments', 'conditions', 'congratulations', 'conserves', 'contemporaries', 'contents', 'contras', 'conveniences', 'cords', 'corduroys', 'corrections', 'cos', 'costs', 'cows', 'crabs', 'craps', 'credentials', 'creeps', 'creepy-crawlies', 'crossroads', 'cs', 'customs', 'cyclopes', 'darts', 'das', 'davys', 'days', 'debs', 'deeds', 'descendants', 'deserts', 'devices', 'dialectics', 'diggings', 'digs', 'dippers', 'dominoes', 'dominos', 'dos', 'doubles',
'dozens', 'draughts', 'drawers', 'dregs', 'duckpins', 'duds', 'dumas', 'dumplings', 'dumps', 'dunkers', 'dynamics', 'eas', 'eggs', 'elements', 'elves', 'ethics', 'eyeglasses', 'eyes', 'falls', 'fas', 'fatigues', 'feelings', 'fesses', 'fields', 'fifties', 'finances', 'findings', 'fives', 'fixings', 'flaps', 'flats', 'flies', 'folks', 'follies', 'followers', 'forties', 'fries', 'fumes', 'fundamentals', 'funds', 'funnies', 'gasteromycetes', 'gates', 'gens', 'giblets', 'glassworks', 'goldfields', 'gospels', 'graphics', 'greaves', 'greens', 'gripes', 'grits', 'groats', 'grotesqueries', 'groves', 'guts', 'gyps', 'hackles', 'hands', 'hanks', 'harmonics', 'hays', 'heaps', 'heavens', 'heaves', 'highlands', 'hindquarters', 'hippies', 'hipsters', 'hives', 'hooks',
'hoops', 'hops', 'horseshoes', 'hours', 'huaraches', 'humans', 'hurdles', 'hymeneals', 'hysterics', 'indris', 'ins', 'ironsides', 'isometrics', 'jacks', 'jackstraws', 'jaspers', 'jimmies', 'jitters', 'johns', 'judges', 'junkers', 'khakis', 'knobkerries', 'knuckles', 'ks', 'lamentations', 'lancers', 'lashings', 'lats', 'laurels', 'laws', 'leaders', 'lees', 'leftovers', 'legs', 'leotards', 'letters', 'liabilities', 'limbers', 'loads', 'lodgings', 'logos', 'loins', 'loos', 'losses', 'lots', 'lowlands', 'lyons', 'majors', 'manes', 'manners', 'marbles', 'marches', 'marines', 'marks', 'mars', 'marshals', 'masses', 'masters', 'mates', 'maths', 'maulers', 'mays', 'means', 'mechanics', 'medlars', 'megrims', 'methodists', 'metrics', 'mills', 'minors', 'minutes', 'mnemonics', 'mods', 'monas', 'mopes', 'morals', 'mores', 'myxomycetes', 'najas', 'names', 'nerves', 'ninepins', 'nones', 'nothings', 'numbers', 'occasions', 'oddments', 'operations', 'optics', 'organs', 'outskirts', 'oxen', 'pants', 'parks', 'parsons', 'parts', 'peanuts', 'pickings', 'piles', 'pinches', 'plaudits', 'pliers', 'plyers', 'polemics', 'polls', 'pooves', 'porcupines',
'prelims', 'premises', 'primates', 'privates', 'proceedings', 'profits', 'propaedeutics', 'prophets', 'props', 'proverbs', 'provisions', 'psalms', 'pyrites', 'quadratics', 'queens', 'quoits', 'raffles', 'rafts', 'rails', 'rastas', 'rates', 'receipts', 'relations', 'reserves', 'rings', 'roads', 'rockers', 'rooms', 'roots', 'rounders', 'ruddles', 'rudiments', 'sales', 'sauternes',
'saxes', 'scads', 'scopes', 'scores', 'scots', 'scraps', 'scruples', 'seats', 'sellers', 'sens', 'services', 'sessions', 'settlings', 'shakers', 'shambles', 'shears', 'shekels', 'shirtsleeves', 'shoes', 'shorts', 'shucks', 'silks', 'sills', 'silversides', 'singles', 'sis', 'skinheads', 'skittles', 'skivvies', 'slews', 'slops', 'snips', 'snuffers', 'sops', 'sos', 'soviets', 'spareribs', 'specs', 'spectacles', 'spillikins', 'spirits', 'splinters', 'spots', 'sprinkles', 'sprites', 'stacks', 'staggers', 'stairs', 'stakes', 'stalls', 'stamina', 'stations', 'stays', 'steps', 'stigmata', 'stockholdings', 'stops', 'straits', 'stripes', 'sweetbreads', 'tabernacles', 'tactics', 'tails', 'talks', 'taps', 'taxis', 'tears', 'teens', 'tenpins', 'terms', 'testudines',
'things', 'thirties', 'threads', 'throes', 'tigers', 'tons', 'tours', 'transactions', 'trappings', 'trembles', 'trimmings', 'troglodytes', 'troops', 'tropics', 'trumpets', 'trunks', 'tums', 'twenties', 'twins', 'values', 'vapors', 'velours', 'vespers', 'viands', 'vibes', 'victuals', 'viewers', 'waders', 'wads', 'wages', 'wales', 'watts', 'ways', 'weeds', 'wells', 'whiskers', 'windows', 'wings', 'winnings', 'wits', 'woods', 'words', 'workings', 'yaws', 'years', 'yips'] In total, For each of these plurals, ['beanies', 'colors', 'corps', 'corpses', 'coss', 'cruses', 'dies', 'doses', 'gendarmeries', 'grazes', 'grounds', 'kurus', 'morses', 'motives', 'muses', 'paraleipses', 'ploughmen', 'raves', 'reeves', 'reverses', 'roomies', 'ruses', 'senses', 'serves', 'sharpies', 'species', 'stogies', 'taeniae', 'touracos', 'uses', 'vases', 'weirdies'] In total, And for each of these plurals both methods return a different but valid alternative spelling. ['adzes', 'aeries', 'alexanders', 'annexes', 'aunties', 'battle-axes', 'bennies', 'blintzes', 'bogies', 'bolshies', 'booties', 'broadaxes', 'caddies', 'cartouches', 'cookies', 'coolies', 'cowries', 'crosses', 'darkies', 'dearies', 'dickies', 'doggies', 'dogies', 'eyries', 'faeries', 'floozies', 'goonies', 'grannies', 'hankies', 'hoagies', 'honkies', 'innings', 'junkies', 'kelpies', 'lenses', 'links', 'loonies', 'marquises', 'meanies', 'mews', 'mollies', 'oraches', 'organdies', 'panties', 'pas', 'pavises', 'pickaxes', 'pinkies', 'pixies', 'poleaxes', 'punkies', 'quickies', 'reveries', 'scrubs', 'shingles', 'sises', 'smoothies', 'softies', 'statistics', 'stymies', 'thrips', 'townies'] In total, there are 62 occurrences of both methods returning a different but valid alternative spelling. In conclusion, though choosing the shortest lemma (like So, this is not something that should be changed.
|
Also where |
Also |
|
Like most linguistic tools, lemmatizers are not perfect, their accuracy is an open research problem, and more research is needed in order to achieve perfect lemmatization. This suggests to draw a distinction between open research problems vs. software issues. (Edited) So, as long as we think that NLTK correctly implements the best known algorithms, there might not be a real software issue. But thanks to @nezda's comment below , we now know that NLTK's implementation cannot be correct. However, it is also well known that Princeton morphy is overly permissive (it "over-generates"). This is usually considered a convenient feature for analysis, and not a bug. For example, the Princeton wn program recognizes cates as cat, and siped as sip. It is a known feature of the original morphy algorithm, and fixing it would mean implementing a different algorithm. This suggests fixing only those errors in NLTK's morphy implementation, which do not stem from the original Princeton algorithm. |
A lookup table would be 100% accurate (for cases that are included), and would not require new research. It would require some sweat, but we have the start of such a table right here in this GitHub issue.
…On December 29, 2023 3:30:04 AM EST, Eric Kafe ***@***.***> wrote:
Like most linguistic tools, lemmatizers are not perfect, their accuracy is an open research problem, and more research is needed in order to achieve perfect lemmatization.
This suggests to draw a distinction between open research problems vs. software issues. For ex. here, where NLTK correctly implements the best known algorithms, there does not seem to be a software issue.
--
Reply to this email directly or view it on GitHub:
#2567 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
Seems like the original WordNet's Morphy implementation doesn't have many of these bugs and maybe these are in fact bugs in this implementation? For example:
|
Yes @novalis, a lexicon of inflected forms would be nice to have. There exists a good one for French, The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French. However, such resources can only handle words within the limits of the vocabulary that they know. |
Thanks @caiw and @nezda: indeed, the original Morphy from Princeton does not have these bugs, and neither does my Prolog implementation, wn_morphy.pl. |
WordNetLemmatizer and morphy are only simple wrappers, that pick just one single lemma from the list of lemmas found by _morphy. The strategies chosen (shortest vs. first found lemma) may not always be adequate, and it is especially problematic that WordNetLemmatizer just returns any garbage input as its own lemma. However, I don't think these wrappers need fixing. Instead, alternative stemmers are available, and more can be developed. But we can, and should, fix the main _morphy function in wordnet.py, which returns a list of lemma candidates. It is mostly a faithful implementation of Princeton Morphy, except from one huge difference: whereas the original only does one pass over the possible morphological substitutions, the NLTK implementation can recursively keep removing suffixes. This practice dates back to the original pywordnet by @osteele, but it is not a part of the original algorithm from Princeton Morphy. For example, without this spurious recursion step, no lemma would be found for "iss" using morphy, because the "s" ending would only be stripped once. |
now that's what i call a deep dive - nice one @ekaf |
Some words are lemmatized improperly, due to picking the smallest possible lemma:
The text was updated successfully, but these errors were encountered: