Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wordnet's all_synsets() function #3078

Merged
merged 16 commits into from
Dec 7, 2022
Merged

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Dec 1, 2022

Fix #3077, and a few related problems in wordnet.py.

  1. With the English Wordnet, properly distinguish satellites from adjectives:
from nltk.corpus import wordnet as wn
print(wn.get_version())

3.0

print(len(list(wn.all_synsets(pos="a"))))
18156

print(len(list(wn.all_synsets(pos="s"))))
10693

  1. With other languages, a part of the problem is due to the synset_from_pos_and_offset() function, which returns an empty synset object when an offset is incorrect. This leads to inflated synset counts with OMW for two reasons: OMW does include some incorrect offsets, and with later wordnets some synsets that are lost in mapping should not be counted. So, when a synset does not exist, this PR returns None instead of an empty synset object:
ss = wn.synset_from_pos_and_offset('n',1234)
print(ss)

None

wn.add_omw()
print(len(list(wn.all_synsets(lang="hrv"))))

23115
print(len(list(wn.all_synsets(pos="n", lang="hrv"))))
16177
print(len(list(wn.all_synsets(pos="v", lang="hrv"))))
4736

Another problem is that the handling of satellites in the mappings was inconsistent. The mapping procedure has now been rewritten in a style that makes it easier to follow the underlying algorithm.

from nltk.corpus import wordnet31 as wn31
print(wn31.get_version())

3.1

wn31.add_omw()
print(len(list(wn31.all_synsets(lang="hrv"))))

23098
print(len(list(wn31.all_synsets(pos="a", lang="hrv"))))
1363
print(len(list(wn31.all_synsets(pos="s", lang="hrv"))))
444

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some simple tests. I also accidentally brought in 2f34b52 into this PR, so I rebased and removed some of the unnecessary commits. Apologies for this. Make sure to update locally to avoid issues if you make changes!

I had some comments and requests for changes. I hope you'll be able to have a look at them.

nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved
nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved
nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved
nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved
Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Thanks for always helping maintain Wordnet in NLTK!

@tomaarsen tomaarsen merged commit 3ca43e2 into nltk:develop Dec 7, 2022
@ekaf ekaf deleted the all_synsets branch December 8, 2022 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect part-of-speech filtering in Wordnet's all_synsets()
2 participants