Fix wordnet's all_synsets() function #3078

ekaf · 2022-12-01T15:15:46Z

Fix #3077, and a few related problems in wordnet.py.

With the English Wordnet, properly distinguish satellites from adjectives:

from nltk.corpus import wordnet as wn
print(wn.get_version())

3.0

print(len(list(wn.all_synsets(pos="a"))))
18156

print(len(list(wn.all_synsets(pos="s"))))
10693

With other languages, a part of the problem is due to the synset_from_pos_and_offset() function, which returns an empty synset object when an offset is incorrect. This leads to inflated synset counts with OMW for two reasons: OMW does include some incorrect offsets, and with later wordnets some synsets that are lost in mapping should not be counted. So, when a synset does not exist, this PR returns None instead of an empty synset object:

ss = wn.synset_from_pos_and_offset('n',1234)
print(ss)

None

wn.add_omw()
print(len(list(wn.all_synsets(lang="hrv"))))

23115
print(len(list(wn.all_synsets(pos="n", lang="hrv"))))
16177
print(len(list(wn.all_synsets(pos="v", lang="hrv"))))
4736

Another problem is that the handling of satellites in the mappings was inconsistent. The mapping procedure has now been rewritten in a style that makes it easier to follow the underlying algorithm.

from nltk.corpus import wordnet31 as wn31
print(wn31.get_version())

3.1

wn31.add_omw()
print(len(list(wn31.all_synsets(lang="hrv"))))

23098
print(len(list(wn31.all_synsets(pos="a", lang="hrv"))))
1363
print(len(list(wn31.all_synsets(pos="s", lang="hrv"))))
444

This reverts commit f2b9ce4.

This reverts commit cbfc2ec.

This reverts commit 46fb31b.

…synsets

tomaarsen

I've added some simple tests. I also accidentally brought in 2f34b52 into this PR, so I rebased and removed some of the unnecessary commits. Apologies for this. Make sure to update locally to avoid issues if you make changes!

I had some comments and requests for changes. I hope you'll be able to have a look at them.

nltk/corpus/reader/wordnet.py

…to all_synsets

tomaarsen

This is looking good! Thanks for always helping maintain Wordnet in NLTK!

ekaf added 11 commits October 5, 2022 11:06

Add support for ISO-639-3 language codes

46fb31b

Add langname() function with doctest

f2b9ce4

Add file header

cbfc2ec

Revert "Add langname() function with doctest"

5b15da6

This reverts commit f2b9ce4.

Revert "Add file header"

a279fed

This reverts commit cbfc2ec.

Revert "Add support for ISO-639-3 language codes"

acb8c23

This reverts commit 46fb31b.

Merge remote-tracking branch 'upstream/develop' into develop

85c3fc6

Merge remote-tracking branch 'upstream/develop' into develop

daee9a4

Merge remote-tracking branch 'upstream/develop' into develop

7c75271

Merge remote-tracking branch 'upstream/develop' into develop

fe227ce

Fix all_synsets() function

a3d6ae4

github-actions bot added CI corpus wordnet labels Dec 6, 2022

ekaf and others added 3 commits December 6, 2022 13:24

Fix all_synsets() function

7c86c31

Add simple regression tests for nltk#3077

60b19b7

Merge branch 'develop' of https://github.com/nltk/nltk into ekaf-all_…

518a28a

…synsets

tomaarsen force-pushed the all_synsets branch from f067ee0 to 518a28a Compare December 6, 2022 12:28

github-actions bot removed the CI label Dec 6, 2022

tomaarsen requested changes Dec 6, 2022

View reviewed changes

nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved

nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved

nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved

nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved

ekaf added 2 commits December 7, 2022 09:22

Merge commit 'refs/pull/3078/head' of https://github.com/nltk/nltk in…

e2938bd

…to all_synsets

Add suggestions by @tomaarsen

dda5d05

tomaarsen approved these changes Dec 7, 2022

View reviewed changes

tomaarsen merged commit 3ca43e2 into nltk:develop Dec 7, 2022

ekaf deleted the all_synsets branch December 8, 2022 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wordnet's all_synsets() function #3078

Fix wordnet's all_synsets() function #3078

ekaf commented Dec 1, 2022

tomaarsen left a comment

tomaarsen left a comment

Fix wordnet's all_synsets() function #3078

Fix wordnet's all_synsets() function #3078

Conversation

ekaf commented Dec 1, 2022

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen left a comment

Choose a reason for hiding this comment