Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged synsets are lost in translation #179

Open
ekaf opened this issue Nov 14, 2022 · 8 comments
Open

Merged synsets are lost in translation #179

ekaf opened this issue Nov 14, 2022 · 8 comments
Labels
bug Something isn't working

Comments

@ekaf
Copy link

ekaf commented Nov 14, 2022

Describe the bug

Wn loses some merged synsets in translation, even though the original CILI mappings correctly link the merged source synsets to the same target synset.

To Reproduce

For exemple, consider these two synsets in the ili-map-pwn31.tab mapping, which map to the same PWN 3.1 target:

i37881 00472688-n
i37882 00472688-n

With Wn, the first synset (i37881) has no translation in OEWN, although it should, if i37881 was mapped to i37882:

import wn
wnfi = wn.Wordnet("omw-fi")
ss1 = wnfi.synsets(ili="i37881")[0]
print(f"{ss1.ili.id}, {ss1.senses()}, {ss1.translate('oewn')}")

i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], []
So the translation above is just the empty list ([]).

By contrast, the other merged synset translates correctly:

ss2 = wnfi.synsets(ili="i37882")[0]
print(f"{ss2.ili.id}, {ss2.senses()}, {ss2.translate('oewn')}")

i37882, [Sense('omw-fi-baseball-00474568-n')], [Synset('oewn-00472688-n')]

The same problem occurs with any other merged synsets.

Expected behavior

The first synset (i37881) would have a translation in OEWN, if the CILI mapping was used as intended.

Environment

python --version
python -m wn --version
python -m wn lexicons

Python 3.9.2
Wn 0.9.2
oewn 2021 [en] Open English WordNet
omw-en 1.4 [en] OMW English Wordnet based on WordNet 3.0
omw-cmn 1.4 [cmn-Hans] Chinese Open Wordnet
omw-es 1.4 [es] Multilingual Central Repository (Spanish)
omw-lt 1.4 [lt] Lithuanian WordNet
omw-pt 1.4 [pt] OpenWN-PT
omw-id 1.4 [id] Wordnet Bahasa (Indonesian)
omw-he 1.4 [he] Hebrew Wordnet
omw-eu 1.4 [eu] Multilingual Central Repository (Basque)
omw-sq 1.4 [sq] Albanet
omw-zsm 1.4 [zsm] Wordnet Bahasa (Malaysian)
omw-arb 1.4 [arb] Arabic WordNet (AWN v2)
omw-ca 1.4 [ca] Multilingual Central Repository (Catalan)
omw-fi 1.4 [fi] FinnWordNet
omw-sv 1.4 [sv] WordNet-SALDO
omw-gl 1.4 [gl] Multilingual Central Repository (Galician)
omw-el 1.4 [el] Greek Wordnet
omw-pl 1.4 [pl] plWordNet
omw-iwn 1.4 [it] ItalWordNet
omw-ro 1.4 [ro] Romanian Wordnet
omw-nl 1.4 [nl] Open Dutch WordNet
omw-ja 1.4 [ja] Japanese Wordnet
omw-fr 1.4 [fr] WOLF (Wordnet Libre du Français)
omw-sk 1.4 [sk] Slovak WordNet
omw-is 1.4 [is] IceWordNet
omw-it 1.4 [it] MultiWordNet (Italian)
omw-hr 1.4 [hr] Croatian Wordnet
omw-th 1.4 [th] Thai Wordnet
omw-bg 1.4 [bg] BulTreeBank Wordnet (BTB-WN)
omw-nb 1.4 [nb] Norwegian Wordnet (Bokmål)
omw-da 1.4 [da] DanNet
omw-nn 1.4 [nn] Norwegian Wordnet (Nynorsk)
omw-sl 1.4 [sl] sloWNet

Additional Context

At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn. However, this is not easy, since a rather big detour is necessary to obtain the sense keys in the 'oewn' lexicon.

@ekaf ekaf added the bug Something isn't working label Nov 14, 2022
@goodmami
Copy link
Owner

Thanks, I think I see the problem, but let me make sure I got it right: there is a gap in ILI-based translation coverage when the target synset (and thus its ILI) has been merged into another. In this case, PWN 3.0 (and OMW lexicons expanded from it) have two synsets, but in PWN 3.1 and OEWN they are merged into a single synset.

Due to the way Wn applies the ILI mappings

There seems to be a mistaken assumption here. Wn does not use the ILI mappings that you are referring to. The only resource from https://github.com/globalwordnet/cili/ that it uses (and only if you've downloaded it) is the released CILI inventory which includes the ILI identifiers and definitions. Inter-lexicon relationships via shared ILIs are identified only by the ili attribute on <Synset> elements in WN-LMF lexicons. This attribute's value is limited to a single ILI, so there is a technical limitation that we cannot map multiple ILIs to a synset. This also follows the theoretical constraint that ILIs should be mapped to no more than one synset, and vice versa, within a lexicon.

Therefore, I disagree that there is something here incorrect in Wn, but I do recognize how things could be improved. A satisfactory solution to this issue is thus not so much a bug fix as a new feature: to store (or identify) and subsequently use changes to synset-ILI mappings across versions. This sounds appealing but I also feel like it will be hard to do correctly in a transparent fashion (e.g., when calling Synset.translate()) rather than as a discrete mapping step across lexicons. For instance, what if you translate in the other direction where the single ILI is "split" into two? Or if the translation is between two other lexicons with different changes in mappings.

At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn.

You mean to look for senses with the same sense keys across lexicons? That might work to build the merge-mapping yourself, but it wouldn't be a solution in general because senses link synsets to words and therefore non-English lexicons should have different sense keys (but more likely they do not have them at all).

Here's how you could build the mapping:

>>> import wn
>>> en30 = wn.Wordnet('omw-en')
>>> en31 = wn.Wordnet('omw-en31')
>>> en31_sensekey_ili_map = {
...     s.metadata()['identifier']: s.synset().ili
...     for s in en31.senses()
... }
>>> en30_31_ilis = {ss.ili.id: set() for ss in en30.synsets()}
>>> for s in en30.senses():
...     ili = en31_sensekey_ili_map.get(s.metadata()['identifier'])
...     if ili:
...         en30_31_ilis[s.synset().ili.id].add(ili.id)
... 
>>> en30_31_ilis['i37881']
{'i37882'}
>>> en30_31_ilis['i37882']
{'i37882'}

This mapping is unidirectional, PWN 3.0 to PWN 3.1, but maybe it is useful nonetheless.

@ekaf
Copy link
Author

ekaf commented Nov 15, 2022

Thanks @goodmami, I have corrected the formulation, since I don't want to imply that something is wrong with Wn. On the other hand, there is a problem in Wn, due to the way that the CILI mappings are applied, but I realize that this happens in OMW-data, when building the LMF databases.
I want to look more into this, and am missing a way to lookup the CILI mappings from within Wn. The CILI project is installed, but I have not yet found out how to load and query it.

@goodmami
Copy link
Owner

@ekaf I think the CILI as a resource is better thought of as the inventory of identifiers and their definitions than as a collection of mappings. The mappings to synsets should be maintained by the respective wordnet projects, although in practice we keep some mapping files in the CILI repository. Those mappings files are used when creating the WN-LMF exports of the PWN.

Let me try to describe ILI support in Wn. WN-LMF lexicons can link synsets to individual ILIs like this (example from OEWN 2021):

    <Synset id="oewn-15307914-n" ili="i117563" members="oewn-speed-n oewn-velocity-n" partOfSpeech="n" dc:subject="noun.time">
                                 ~~~~~~~~~~~~~

These ILIs are stored in Wn's database linked to the synsets. When a second lexicon is loaded containing synsets with the same ILIs, such as this (from OMW 1.4's Spanish wordnet):

    <Synset id="omw-es-15282696-n" ili="i117563" partOfSpeech="n" members="omw-es-velocidad-15282696-n" />
                                   ~~~~~~~~~~~~~

... then Wn is able to use the shared ILI to link the synsets across lexicons for translation or expanded relation traversal. Another thing we see is synsets with the special ILI in, which indicates that that version of the lexicon is proposing the synset as a candidate for a new ILI. For example:

    <Synset id="oewn-90002921-n" ili="in" members="oewn-snow_day-n" partOfSpeech="n" dc:subject="noun.time" dc:source="Colloquial WordNet">
                                 ~~~~~~~~

These proposed ILIs are not used for translation or expanded relation traversals. In Wn, the ILIs are represented by a class with an id, a status, and a definition. For example (here, the cili project has not been loaded in Wn):

>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> oewn.synsets('velocity')[0].ili.id  # an explicit ID
'i117563'
>>> oewn.synsets('velocity')[0].ili.status
'presupposed'
>>> oewn.synsets('velocity')[0].ili.definition()
>>> oewn.synsets('snow day')[0].ili.id  # ili="in" is special and the ID is None in Wn
>>> oewn.synsets('snow day')[0].ili.status
'proposed'
>>> oewn.synsets('snow day')[0].ili.definition()
'a day on which school or other events are cancelled due to snow'

Note:

  • The status presupposed means that the synset has an explicit ILI but there is no authoritative source to say whether the ILI is valid or not. The status proposed means that the lexicon used the special ILI in.
  • Explicit ILIs do not have ILI definitions in the lexicon, but proposed ILIs do. Note that ILI definitions are separate from synset definitions.

When the cili resource has been loaded, the presupposed statuses can change and their definitions become available:

>>> wn.download('cili')
...
>>> oewn.synsets('velocity')[0].ili.status
'active'
>>> oewn.synsets('velocity')[0].ili.definition()
'distance travelled per unit time'

The cili resource that is added here contains only a list of ILIs and their definitions (and maybe statuses in a future version: globalwordnet/cili#8), and does not contain any mappings to PWN 3.0 or 3.1 synsets.

Does that help?

@ekaf
Copy link
Author

ekaf commented Nov 16, 2022

Thanks @goodmami, yes your explanations help a lot indeed.
Concerning my specific problem, i.e. obtaining translations for synsets that would have one according to the CILI, but had none when querying Wn, the code you provided for mapping from en-30 ilis to en-31 ilis indeed solves the problem for en-31. With oewn, a detour is necessary, since it has sensekeys encoded as sense.id, but it works equally well:


import wn

#---------------------------------------------------------------------
# adapted from english-wordnet/scripts/wordnet_yaml.py, by @jmccrae:

def unmap_sense_key(sk):
    e = sk.split("__")
    l = e[0][5:]
    r = "__".join(e[1:])
    return (l.replace("-ap-", "'").replace("-sl-", "/").replace("-ex-", "!").replace("-cm-",",").replace("-cl-",":") +
        "%" + r.replace(".", ":").replace("-sp-","_"))

#---------------------------------------------------------------------

def sense2key(sense, wnid="omw-en"):
    if wnid == 'oewn':
        return unmap_sense_key(sense.id)
    else:
        return sense.metadata()['identifier']

def map30(target):
    wnet = wn.Wordnet(target)
    wnid = wnet.lexicons()[0].id
    sk_ili = {sense2key(se, wnid): se.synset().ili for se in wnet.senses()}
    ilimap30 = {}
    for se in wn.Wordnet("omw-en").senses():
        ili = sk_ili.get(se.metadata()['identifier'])
        if ili and ili.status != "proposed":
            ilimap30[se.synset().ili.id] = ili.id
    return ilimap30

#---------------------------------------------------------------------

#target = "omw-en31"
target = "oewn"

ilimap = map30(target)

i1 = "i37881"
i2 = "i37882"

print(ilimap[i1])

i37882

print(ilimap[i2])

i37882

wnfi = wn.Wordnet("omw-fi")
wn2 = wn.Wordnet(target)

ss1 = wnfi.synsets(ili = i1)[0]
print(f"{ss1.ili.id}, {ss1.senses()},\n\
 {ss1.translate(target)}, {wn2.synsets(ili=ilimap[i1])[0]}")

Now, the mapping can provide a translation for this Finnish synset, which has none using Wn's translate() function.

i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')],
[], Synset('oewn-00472688-n')

So in Wn at present, we have to go through sense-key mappings in order to avoid this problem. I suppose there could be a more direct way to use the CILI mappings, without necessarily losing synsets in the translation, since CILI contains information about the merged synsets. But even then, it remains to be seen whether ILI mappings can match the performance of sense-key mappings.

@ekaf
Copy link
Author

ekaf commented Nov 17, 2022

As @goodmami wrote:

what if you translate in the other direction where the single ILI is "split" into two?

Yes, the inverse problem is that currently, when translating in the opposite direction, Wn only returns one of the merged synsets:

i2 = "i37882"
print(wn.Wordnet("oewn").synsets(ili = i2)[0].translate("omw-fi"))

[Synset('omw-fi-00474568-n')]

In that case, the complete translation would be the union of the senses belonging to all the synsets obtained by reversing the ilimap from above:

def rev_dict(dic):
    rdic = {}
    for key,val in dic.items():
        if val not in rdic:
            rdic[val] = {key}
        else:
            rdic[val].add(key)
    return rdic

sources = rev_dict(ilimap)[i2]

print(f"{sources} --> {i2}")

{'i37881', 'i37882'} --> i37882

print([wn.Wordnet("omw-fi").synsets(ili = i)[0].senses() for i in sources])

[[Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], [Sense('omw-fi-baseball-00474568-n')]]

@goodmami
Copy link
Owner

goodmami commented May 7, 2023

@ekaf Can you remind me what is the expected fix here? Currently I'm leaning toward saying this is a data challenge (best solved with documentation) and not a bug or missing feature in the code, but maybe you have something in mind that would be appropriate for this library.

@ekaf
Copy link
Author

ekaf commented May 8, 2023

There are relatively few (around 30) merged synsets between each English Wordnet version, so losing 30 synsets in translation may not seem a huge problem. However, it is not solved with documentation alone, and a solution in the library appears more helpul.

Since version 3.6.6 (see nltk/#2889), NLTK's wordnet.py library produces a sense-key based mapping "on the fly", at load time, preventing this problem from ever occurring. A similar approach can work in Wn, using code like in the comment above.

An alternative could be if the ILI project also produces lists of merged synsets, with one (or more) synset(s) deprecated and linked to a target synset. This approach is less versatile, because each future English Wordnet needs a separate list of deprecations: you would have to wait for such lists to be produced, then rely on their adequacy, and still need additional code to interpret the deprecations in Wn.

@goodmami
Copy link
Owner

@ekaf thank you for explaining. I'm not entirely sold on this solution because it encodes lexicon-specific information (the sense keys and where they are stored), which are really only relevant for the English wordnets, and I strive as much as possible for Wn to not favor any particular wordnet or language (with the exception of the included Morphy lemmatizer).

That said, so many wordnets are based on the English structure that it might make sense for practicality to beat purity here. The ILI solution would be more "pure", but, as you describe, that approach has other issues.

@fcbond, I'd like to get your perspective. Should Wn codify English-specific workarounds for merged synsets across wordnet versions? Or maybe the problem is rare enough that some documentation of the problem with a recipe for getting around it would suffice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants