Support OMW 1.4 #2899

ekaf · 2021-11-29T13:29:23Z

This PR adapts the multilingual functions in wordnet.py to use the new OMW-data 1.4 (nltk/nltk_data#171), the recent release of the Open Multilingual Wordnet.

The directory structure of the new nltk_data/corpora/omw package has a slightly different layout, where each folder name indicates the provenance of any number of wordnets included in the corresponding folder.

For English and Italian, OMW now includes wordnets from two different provenances, so the lang parameter needs to eventually encode the provenance, in cases where more wordnets exist for the same language.

Also, in addition to lemmas, some wordnets in OMW 1.4 now also include definitions (def) and examples (exe).

This PR supports both the new and the old omw formats.

ekaf · 2021-11-29T13:38:26Z

Example use, adapted from #2423 (comment)

import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1
print(wn.langs())
dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'qcn', 'dan', 'ell', 'eng_eng', 'fas', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])

for ss in wn.synsets('犬', lang='jpn'):
    print(ss)
    for lg in ["jpn","ita_iwn"]:
        print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")
        print(f"{lg} definition:{ss.definition(lang=lg)}")
    print()

Synset('dog.n.01')
jpn lemmas:[Lemma('dog.n.01.ドッグ'), Lemma('dog.n.01.イヌ'), Lemma('dog.n.01.洋犬'), Lemma('dog.n.01.犬'), Lemma('dog.n.01.飼い犬'), Lemma('dog.n.01.飼犬')]
jpn definition:['有史以前から人間に家畜化されて来た（おそらく普通のオオカミを先祖とする）イヌ属の動物', '多数の品種がある']
ita_iwn lemmas:[Lemma('dog.n.01.cane')]
ita_iwn definition:['animale domestico molto comune, diffuso in tutto il mondo, usato per la caccia, la difesa, nella pastorizia, e come animale da compagnia']

Synset('spy.n.01')
jpn lemmas:[Lemma('spy.n.01.スパイ'), Lemma('spy.n.01.いぬ'), Lemma('spy.n.01.回し者'), Lemma('spy.n.01.回者'), Lemma('spy.n.01.密偵'), Lemma('spy.n.01.工作員'), Lemma('spy.n.01.廻し者'), Lemma('spy.n.01.廻者'), Lemma('spy.n.01.探り'), Lemma('spy.n.01.探'), Lemma('spy.n.01.犬'), Lemma('spy.n.01.秘密捜査員'), Lemma('spy.n.01.まわし者'), Lemma('spy.n.01.諜報員'), Lemma('spy.n.01.諜者'), Lemma('spy.n.01.間者'), Lemma('spy.n.01.間諜'), Lemma('spy.n.01.隠密')]
jpn definition:['敵の情報を得るために国家に雇われた、または競合他社の企業秘密を得るために会社に雇われた秘密諜報部員']
ita_iwn lemmas:[Lemma('spy.n.01.agente_segreto'), Lemma('spy.n.01.emissario'), Lemma('spy.n.01.spia')]
ita_iwn definition:['chi esercita lo spionaggio']

ekaf · 2021-12-04T11:48:43Z

import nltk
from nltk.corpus import wordnet as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.0

for lg in sorted(wn.langs()):
    print(f"{lg}: {len(list(wn.words(lang=lg)))} words in {len(list(wn.all_synsets(lang=lg)))} synsets")

als: 5988 words in 4675 synsets
arb: 17785 words in 9916 synsets
bul: 6720 words in 4959 synsets
cat: 46531 words in 45826 synsets
cmn: 61533 words in 42300 synsets
dan: 4468 words in 4476 synsets
ell: 18225 words in 18049 synsets
eng: 147306 words in 117659 synsets
eus: 26240 words in 29413 synsets
fin: 129839 words in 116763 synsets
fra: 55351 words in 59091 synsets
glg: 23124 words in 19311 synsets
heb: 5325 words in 5448 synsets
hrv: 29008 words in 23115 synsets
ind: 36954 words in 38085 synsets
isl: 11504 words in 4951 synsets
ita: 41855 words in 35001 synsets
ita_iwn: 19221 words in 15563 synsets
jpn: 91964 words in 57184 synsets
lit: 11395 words in 9462 synsets
nld: 43077 words in 30177 synsets
nno: 3387 words in 3671 synsets
nob: 4186 words in 4455 synsets
pol: 45387 words in 33826 synsets
por: 54071 words in 43895 synsets
ron: 49987 words in 56026 synsets
slk: 29150 words in 18507 synsets
slv: 40230 words in 42583 synsets
spa: 36681 words in 38512 synsets
swe: 5824 words in 6796 synsets
tha: 82504 words in 73350 synsets
zsm: 33932 words in 36911 synsets

ekaf · 2021-12-04T12:40:04Z

After the latest commit, pytest succeeds on windows but fails on mac and ubuntu.
The error doesn't seem related to this PR:

E LookupError:
E **********************************************************************
E Resource inaugural not found.
E Please use the NLTK Downloader to obtain the resource:

tomaarsen · 2021-12-05T22:55:06Z

@ekaf
This has been caused by an issue now solved through nltk/nltk_data#174. However, the CI has cached the (broken) nltk_data. Rerunning the CI with a new CACHE_VERSION Secret should really force the CI to gather nltk_data from fresh, but it seems that it will not. I'll try to push an empty commit to force the CI to restart. If all is well, that should work.

tomaarsen · 2021-12-05T23:04:04Z

Annoyingly, this does not seem to be working: The broken cache of nltk_data is still being used. I can't really tell why. The CACHE_VERSION secret was changed, and I added that to the cache key solely for allowing me to reset the cache as suggested in this SO link: https://stackoverflow.com/questions/63521430/clear-cache-in-github-actions.

This is getting a bit frustrating.

ekaf · 2021-12-06T09:29:07Z

@tomaarsen, there seem to be workarounds at r-lib/actions#86

ekaf · 2021-12-07T06:11:50Z

@tomaarsen, here is a commit that looks like it worked: geocompx/geocompr@9189efb

ekaf · 2021-12-07T09:28:34Z

@tomaarsen: prefixing the key with "new-" in .github/workflows/ci.yaml actually cleared the cache. This needs to also be done a second place in the file, for the new cache to be used instead of the old one. Maybe an explanation why only changing the secret didn't work could be that this variable is interpreted as void (I'm just guessing...).

ekaf · 2021-12-08T08:13:53Z

@tomaarsen, the changes to .github/workflows/ci.yaml don't belong in this PR, since they solve a completely different problem. So maybe that part should be split out into another PR about clearing cached dependencies. However, since I don't control the ${{ secrets.CACHE_VERSION }} variable, I feel that you would be better equipped to handle this.

On the other hand, the update to wordnet.py is acutely needed in order to fix the new issue #2905 (comment), which arises because the new OMW package was merged into nltk_data, without also merging the present PR.

Please let me know me if there is anything I can do about this.

Even if this fails because of some cache issue, we can just merge as I know this passes locally

tomaarsen

I've scheduled some time this morning to look at this PR. I've reverted to using the "normal" key, and it seems like the cache has been refreshed by now. I've also created a helper method for Synset.definition() and Synset.examples(), as the code for these was near identical.
Beyond that, I had to update some doctests which were failing due to the nltk_data changes.

If these tests are failing for you locally, then either:

Your omw nltk_data is outdated, or
You've updated your omw nltk_data, but the old files were not removed. Deleting the omw folder within nltk_data and re-downloading will solve this. Alternatively, you can delete the entire nltk_data and redownload it all.

This PR is ready for merging as far as I can tell.

The problematic thing is - nltk_data cannot be pinned to some older version. People can't say "Oh, my NLTK is locked to 3.2.5, so I'll use the nltk_data that works with that version". Because of these changes, no NLTK version works like expected, with the exception of this PR.
It is a priority that we merge this PR, and publish a new version.

In part due to this PR and its consequences, I believe it's time to release 3.7.0 rather than 3.6.6. After all, the nltk_data changes essentially deprecate all currently released NLTK versions, I'm afraid.

ekaf · 2021-12-08T13:09:48Z

@tomaarsen , I'm sorry for all the trouble you have with this PR.
I have now tested your changes to wordnet.doctest with "tox -e py39", and everything was fine, except for one failure:


096     >>> len(inaugural.words())
Expected:
    152901
Got:   
    149797

tomaarsen · 2021-12-08T13:11:20Z

@ekaf This is likely a consequence of having an outdated inaugural. python -m nltk.downloader --force inaugural ought to help. If that does not help, then it might be because inaugural was temporarily broken, meaning that you might have unintended files in your local version of nltk_data.

ekaf · 2021-12-08T13:20:38Z

@tomaarsen yes, you are right, with the new inaugural package all tests now succeed.
congratulations :)

tomaarsen · 2021-12-08T13:25:04Z

Glad to hear!

I'll merge this, so people with issues like #2905 at least have a solution that isn't just using this PR. Thanks for these changes, and thanks for bearing with me while we've been having these cache issues.

ekaf · 2021-12-08T13:39:19Z

Definitions and examples also work with Albanian ('als'):

import nltk
from nltk.corpus import wordnet2021 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 2021

ss = wn.synset('school.n.02')
lg='als'
print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")

als lemmas:[Lemma('school.n.02.mësonjëtore'), Lemma('school.n.02.shkollë')]

print(f"{lg} definition:{ss.definition(lang=lg)}")

als definition:['institucion arsimor ku mëson dhe edukohet në mënyrë të organizuar brezi i ri; një institucion i tillë i specializuar; ndërtesa e këtij institucioni']

print(f"{lg} examples:{ss.examples(lang=lg)}")

als examples:['Shkolla është ndërtuar më 1932', 'Ai shkon në shkoll çdo ditë']

ekaf mentioned this pull request Nov 29, 2021

Upgrade omw to 1.4 nltk/nltk_data#171

Merged

Support OMW 1.4

c00901a

tomaarsen added the wordnet label Nov 30, 2021

Adapt license, citation and readme functions

c060806

Add 'langs' parameter to all_synsets()

01c29c5

Triggering the CI

cebcf0d

Merge branch 'develop' of https://github.com/nltk/nltk into pr/2899

d1a0ad1

ekaf added 2 commits December 7, 2021 10:20

Trying to clear cached nlyk_data

197686b

Merge remote-tracking branch 'upstream/develop' into omw14

d13725d

Merge branch 'omw14' of https://github.com/ekaf/nltk into omw14

7f3fdf1

This comment was marked as spam.

Sign in to view

Also use the new key

4e3f01c

nltk blocked Memode Dec 7, 2021

ekaf mentioned this pull request Dec 7, 2021

cmn wordnet folder name and data .tab name doesn't match #2905

Closed

tomaarsen added 4 commits December 8, 2021 10:10

Revert ci.yaml to default key

d24fd33

Even if this fails because of some cache issue, we can just merge as I know this passes locally

Create helper method for Synset.definitions and Synset.examples

bdf4220

Remove unused variable

227e013

Updated wordnet doctests to support OMW 1.4

4d8d26f

tomaarsen approved these changes Dec 8, 2021

View reviewed changes

tomaarsen merged commit 8ed8b70 into nltk:develop Dec 8, 2021

ekaf deleted the omw14 branch December 8, 2021 14:21

ekaf mentioned this pull request Jan 17, 2024

Update omw data #1619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support OMW 1.4 #2899

Support OMW 1.4 #2899

ekaf commented Nov 29, 2021

ekaf commented Nov 29, 2021

ekaf commented Dec 4, 2021

ekaf commented Dec 4, 2021

tomaarsen commented Dec 5, 2021

tomaarsen commented Dec 5, 2021

ekaf commented Dec 6, 2021

ekaf commented Dec 7, 2021

ekaf commented Dec 7, 2021

This comment was marked as spam.

ekaf commented Dec 8, 2021

tomaarsen left a comment

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

Support OMW 1.4 #2899

Support OMW 1.4 #2899

Conversation

ekaf commented Nov 29, 2021

ekaf commented Nov 29, 2021

ekaf commented Dec 4, 2021

ekaf commented Dec 4, 2021

tomaarsen commented Dec 5, 2021

tomaarsen commented Dec 5, 2021

ekaf commented Dec 6, 2021

ekaf commented Dec 7, 2021

ekaf commented Dec 7, 2021

This comment was marked as spam.

ekaf commented Dec 8, 2021

tomaarsen left a comment

Choose a reason for hiding this comment

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021