Use Multilingual Wordnet Data from OMW with newer Wordnet versions #2889

ekaf · 2021-11-24T12:10:52Z

This PR complements #2860, by providing a compatibility mapping between any newer Wordnet version (starting with Wordnet 3.1), and the multilingual Wordnet data from nltk_data/corpora/omw, which uses synset offsets from Wordnet version 3.0.

Each Wordnet version uses different synset offsets as identifiers. However, sense keys are permanent across versions. So this PR reads the index.sense file that is provided with each Wordnet package, to automatically construct a mapping from Wordnet 3.0 to any newer Wordnet package, so that the package can be used to access multilingual Wordnet synsets.

The advantage of this approach is that there is no need to wait for hypothetical mappings to be developed by third parties.
Instead, any Wordnet package that adheres to the official Princeton format is supported out of the box. For example, with wordnet31, the attached testlangs.py.txt produces the following statistics:

20/117564 offsets not in Wordnet 3.0 (0.02%)
126/117564 offsets lost in Wordnet 3.1 (0.11%)
328/117659 canonical synset names changed in Wordnet 3.1 (0.28%)

This PR has also been tested with the newer Wordnet versions discussed in #2885.

ekaf · 2021-11-24T13:44:57Z

The statistics with the new wordnet2021 package from nltk/nltk_data#170 show that over 99.8% of the OMW synset offsets can be mapped to Wordnet 2021: 0.02% are OMW errors, which are not in Wordnet 3.0 (omwn/omw-data#24), and 0.17% have been removed.

20/117552 offsets not in Wordnet 3.0 (0.02%)
205/117552 offsets lost in Wordnet 2021 (0.17%)
7154/117659 canonical synset names changed in Wordnet 2021 (6.08%)

However, the Open English WordNet 2021 has a known issue with the ordering of senses (globalwordnet/english-wordnet#773), which produces a higher number of changes in NLTK's canonical synset names.These names are not expected to be stable across versions, but the stability could be better if the new Wordnet 2021 adhered to the sense ordering principles used in earlier Wordnet versions.

ekaf · 2021-11-24T15:44:57Z

Example use:

import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1

lang='spa'
for ss in wn.synset('dog.n.01').closure(lambda ss:ss.hypernyms()):
    translations = [lem.name() for lem in ss.lemmas(lang)]
    print(f"{ss}:{lang}:{translations}")

Synset('canine.n.02'):spa:['cánido']
Synset('domestic_animal.n.01'):spa:[]
Synset('carnivore.n.01'):spa:[]
Synset('animal.n.01'):spa:['animal', 'bestia', 'criatura', 'fauna']
Synset('placental.n.01'):spa:['euterios', 'placentarios']
Synset('organism.n.01'):spa:['organismo', 'ser', 'ser_vivo']
Synset('mammal.n.01'):spa:['mammalia']
Synset('living_thing.n.01'):spa:['organismo', 'ser', 'ser_vivo']
Synset('vertebrate.n.01'):spa:['vertebrado']
Synset('whole.n.02'):spa:['conjunto', 'unidad', 'unidad_completa']
Synset('chordate.n.01'):spa:['chordata']
Synset('object.n.01'):spa:['cosa', 'objeto', 'objeto_físico', 'objeto_inanimado']
Synset('physical_entity.n.01'):spa:['entidad_física']
Synset('entity.n.01'):spa:['entidad']

nltk/corpus/reader/wordnet.py

stevenbird · 2021-11-29T08:07:51Z

@ekaf great contribution!

ekaf added 2 commits November 24, 2021 13:11

Map Wordnet 3.0 to newer Wordnets for OMW compatibility

b73678a

Use Multilingual Wordnets with Wordnet 3.1

ad204fb

ekaf mentioned this pull request Nov 24, 2021

Wordnet2021 nltk/nltk_data#170

Merged

Add support for Wordnet 2021

8f7abe7

tomaarsen added the wordnet label Nov 24, 2021

stevenbird reviewed Nov 26, 2021

View reviewed changes

nltk/corpus/reader/wordnet.py Outdated Show resolved Hide resolved

Use max instead of sorted

93e8934

stevenbird merged commit f50b6b1 into nltk:develop Nov 29, 2021

ekaf deleted the map_wordnets branch December 4, 2021 12:12

ekaf mentioned this pull request Apr 29, 2022

Fix synset_from_sense_key() (#2442) #2988

Merged

ekaf mentioned this pull request May 8, 2023

Merged synsets are lost in translation goodmami/wn#179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Multilingual Wordnet Data from OMW with newer Wordnet versions #2889

Use Multilingual Wordnet Data from OMW with newer Wordnet versions #2889

ekaf commented Nov 24, 2021

ekaf commented Nov 24, 2021

ekaf commented Nov 24, 2021

stevenbird commented Nov 29, 2021

Use Multilingual Wordnet Data from OMW with newer Wordnet versions #2889

Use Multilingual Wordnet Data from OMW with newer Wordnet versions #2889

Conversation

ekaf commented Nov 24, 2021

ekaf commented Nov 24, 2021

ekaf commented Nov 24, 2021

stevenbird commented Nov 29, 2021