Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Multilingual Wordnet Data from OMW with newer Wordnet versions #2889

Merged
merged 4 commits into from Nov 29, 2021

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Nov 24, 2021

This PR complements #2860, by providing a compatibility mapping between any newer Wordnet version (starting with Wordnet 3.1), and the multilingual Wordnet data from nltk_data/corpora/omw, which uses synset offsets from Wordnet version 3.0.

Each Wordnet version uses different synset offsets as identifiers. However, sense keys are permanent across versions. So this PR reads the index.sense file that is provided with each Wordnet package, to automatically construct a mapping from Wordnet 3.0 to any newer Wordnet package, so that the package can be used to access multilingual Wordnet synsets.

The advantage of this approach is that there is no need to wait for hypothetical mappings to be developed by third parties.
Instead, any Wordnet package that adheres to the official Princeton format is supported out of the box. For example, with wordnet31, the attached testlangs.py.txt produces the following statistics:

20/117564 offsets not in Wordnet 3.0 (0.02%)
126/117564 offsets lost in Wordnet 3.1 (0.11%)
328/117659 canonical synset names changed in Wordnet 3.1 (0.28%)

This PR has also been tested with the newer Wordnet versions discussed in #2885.

@ekaf ekaf mentioned this pull request Nov 24, 2021
@ekaf
Copy link
Contributor Author

ekaf commented Nov 24, 2021

The statistics with the new wordnet2021 package from nltk/nltk_data#170 show that over 99.8% of the OMW synset offsets can be mapped to Wordnet 2021: 0.02% are OMW errors, which are not in Wordnet 3.0 (omwn/omw-data#24), and 0.17% have been removed.

20/117552 offsets not in Wordnet 3.0 (0.02%)
205/117552 offsets lost in Wordnet 2021 (0.17%)
7154/117659 canonical synset names changed in Wordnet 2021 (6.08%)

However, the Open English WordNet 2021 has a known issue with the ordering of senses (globalwordnet/english-wordnet#773), which produces a higher number of changes in NLTK's canonical synset names.These names are not expected to be stable across versions, but the stability could be better if the new Wordnet 2021 adhered to the sense ordering principles used in earlier Wordnet versions.

@ekaf
Copy link
Contributor Author

ekaf commented Nov 24, 2021

Example use:

import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1

lang='spa'
for ss in wn.synset('dog.n.01').closure(lambda ss:ss.hypernyms()):
    translations = [lem.name() for lem in ss.lemmas(lang)]
    print(f"{ss}:{lang}:{translations}")

Synset('canine.n.02'):spa:['cánido']
Synset('domestic_animal.n.01'):spa:[]
Synset('carnivore.n.01'):spa:[]
Synset('animal.n.01'):spa:['animal', 'bestia', 'criatura', 'fauna']
Synset('placental.n.01'):spa:['euterios', 'placentarios']
Synset('organism.n.01'):spa:['organismo', 'ser', 'ser_vivo']
Synset('mammal.n.01'):spa:['mammalia']
Synset('living_thing.n.01'):spa:['organismo', 'ser', 'ser_vivo']
Synset('vertebrate.n.01'):spa:['vertebrado']
Synset('whole.n.02'):spa:['conjunto', 'unidad', 'unidad_completa']
Synset('chordate.n.01'):spa:['chordata']
Synset('object.n.01'):spa:['cosa', 'objeto', 'objeto_físico', 'objeto_inanimado']
Synset('physical_entity.n.01'):spa:['entidad_física']
Synset('entity.n.01'):spa:['entidad']

@stevenbird stevenbird merged commit f50b6b1 into nltk:develop Nov 29, 2021
@stevenbird
Copy link
Member

@ekaf great contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants