Extract wordnet into a separate package #2423

stevenbird · 2019-10-03T00:22:09Z

@alvations has created https://github.com/nltk/wordnet
We need to deprecate the existing NLTK wordnet corpus reader

alvations · 2019-10-03T00:22:54Z

@fcbond any advice on this?

goodmami · 2020-09-04T01:13:59Z

I've been consulting with @alvations and @fcbond to work on an entirely new wordnet module: https://github.com/goodmami/wn. It's not yet feature-complete nor documented (contributions are welcome), but the basics are working fairly well.

It better adheres to the Word–Sense–Synset structure used by wordnets instead of the Lemma–Synset simplified structure used by the NLTK module. As such, the API isn't exactly the same as the NLTK module or the spun-out wordnet module.

What is the plan regarding the NLTK? Will there no longer be an NLTK wordnet module? Or is the hope to replace it with a new one? If the latter, are there requirements regarding the API?

goodmami · 2021-11-02T17:19:49Z

Hi, returning to this with an update: the Wn library has been, since shortly after my last comment, documented and reasonably feature-complete since the 0.8.0 release about four months ago. Recently I've been working on increasing compatibility with the NLTK's wordnet, on two fronts:

A (mostly trivial) Python shim module that implements the NLTK's API on top of Wn's
Updating the OMW's data to fix some regressions in its export of the WordNet 3.0 data and so we can recreate the NLTK's synset identifiers (e.g., dog.n.01)

The idea is that these two things would allow it to pass the NLTK's doctests. I never got a response to my questions above, so I'm not sure if the goal was to integrate the new system into the NLTK or to just change the documentation to refer to it. There was some more discussion at nltk/nltk_data#160.

Meanwhile it seems that @ekaf has been also doing good work (#2860, nltk/nltk_data#165) to incorporate WordNet 3.1 and to bring the WNDB export of the latest Open English Wordnet (OEWN) to the NLTK. These are good and welcome improvements, but render my earlier efforts somewhat unnecessary. In addition, the WordNet 3.1 and OEWN 2021 synset offsets are not compatible with the OMW's tab files, so users must continue to use WordNet 3.0 for them to work. @fcbond would know if it's difficult to re-export the OMW for these newer wordnets, although in the WN-LMF-formatted lexicons we have a version-independent solution: the interlingual index.

So I'd like to know if I should stop working on NLTK-compatibility for the Wn library. Certainly @ekaf's solution is a less disruptive way to bring NLTK users up to speed, at least for English. If so, I think this issue can be closed.

fcbond · 2021-11-03T04:04:46Z

Hi, I think @ekaf has been working on better reading of the old wordnet database structure, which is not really being used any more. Although there is a script to produce this format for the OEWN, it is not going to be able to support new information such as the new relations supported by the OEWN, information about orthographic variants, pronunciation and so forth. The vast majority of new wordnets are being produced in some kind of LMF, with the GWA's format I think being the most supported. So the Wn library will support out of the box the German Wordnet, the Uzbeck wordent, and many others. So I think that, although sticking with the current code is less disruptive, it does not allow us to move forward to incorporate more non-English wordnets, or even take advantage of all the improvements of the latest and greatest English wordnet. So I think you should keep working on NLTK-compatibility for the Wn library, and perhaps we should start discussion of how exactly the changeover should happen? We could try to (a) replace the existing module, or (b) give people a choice of both? In terms of long term ease of maintenance, I favor (a).

…

On Wed, Nov 3, 2021 at 1:20 AM Michael Wayne Goodman < ***@***.***> wrote: Hi, returning to this with an update: the Wn library has been, since shortly after my last comment, documented <https://wn.readthedocs.io/en/latest/> and reasonably feature-complete since the 0.8.0 release about four months ago. Recently I've been working on increasing compatibility with the NLTK's wordnet, on two fronts: 1. A (mostly trivial) Python shim module that implements the NLTK's API on top of Wn's 2. Updating the OMW's data <omwn/omw-data#16> to fix some regressions in its export of the WordNet 3.0 data and so we can recreate the NLTK's synset identifiers (e.g., dog.n.01) The idea is that these two things would allow it to pass the NLTK's doctests. I never got a response to my questions above <#2423 (comment)>, so I'm not sure if the goal was to integrate the new system into the NLTK or to just change the documentation to refer to it. There was some more discussion at nltk/nltk_data#160 <nltk/nltk_data#160>. Meanwhile it seems that @ekaf <https://github.com/ekaf> has been also doing good work (#2860 <#2860>, nltk/nltk_data#165 <nltk/nltk_data#165>) to incorporate WordNet 3.1 and to bring the WNDB export of the latest Open English Wordnet (OEWN) to the NLTK. These are good and welcome improvements, but render my earlier efforts somewhat unnecessary. In addition, the WordNet 3.1 and OEWN 2021 synset offsets are not compatible with the OMW's tab files, so users must continue to use WordNet 3.0 for them to work. @fcbond <https://github.com/fcbond> would know if it's difficult to re-export the OMW for these newer wordnets, although in the WN-LMF-formatted lexicons we have a version-independent solution: the interlingual index <https://github.com/globalwordnet/cili/>. So I'd like to know if I should stop working on NLTK-compatibility for the Wn library. Certainly @ekaf <https://github.com/ekaf>'s solution is a less disruptive way to bring NLTK users up to speed, at least for English. If so, I think this issue can be closed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2423 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRX4NASDE2OD3AUFH3TUKATVBANCNFSM4I444W6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

ekaf · 2021-11-03T04:59:48Z

Thanks @goodmami for your kind words about my recent updates in the standard wordnet module. It is indeed in better shape now, although the lack of compatibility of OMW with newer Wordnet versions is a shortcoming.

Concerning future plans, a pluralism of approaches is always an advantage and, as @fcbond explains, the new Wn model is very helpful for supporting the LMF format, which NLTK's standard wordnet module doesn't. The combination of ILI and OEWN is certainly promising, and I also look forward to seeing the OMW catching up with the recent Wordnet developments.

On the other hand, OEWN releases still tend to have much more bugs than Princeton WordNet, so people might still want to have a choice for some time. The standard NLTK wordnet module has much functionality, a large user base, and is still useful until something better comes up.

goodmami · 2021-11-03T17:55:28Z

We could try to (a) replace the existing module, or (b) give people a choice of both? In terms of long term ease of maintenance, I favor (a).

We could follow what the Python documentation does in similar situations. For instance, urllib.request simply recommends that people use the 3rd-party Requests package, or how the re module recommends the 3rd-party regex package, but the standard library modules are still there. Some reasons for not incorporating these into Python's standard library: slower release cycle for Python, maintenance burden, little gain since they are readily available on PyPI. But adding them to the standard library makes them visible and gives them status as the standard way to do things.

So for Wn and NLTK, maybe the docs could point to Wn for the duration of NLTK's 3.* major version, then for 4.0 we can decide if it's a good idea to make the switch? Waiting for 4.0 wouldn't be necessary if we could guarantee a compatible API, however (or does NLTK not use semantic versioning?).

ekaf · 2021-11-04T07:06:42Z

Concerning the maintenability claim, an important factor is the size of the codebase: the existing wordnet module counts 2301 lines, while the new Wn module is more than triple that size, with 7242 lines, while not feature-complete. Also, who will maintain it? The standard module had many contributors, while future Wn contributions appear less certain.

fcbond · 2021-11-04T07:47:58Z

Hi, I am not sure comparing numbers of maintainers for a long-lived module vs a very new one is very meaningful. I am a contributor to both, and intend to keep maintaining Wn. Of course, if it becomes used widely in NLTK then I would expect more people to help maintain it, ... But my argument was not that Wn is more maintainable, rather that it is better to try to only maintain one package not two, and of the two Wn is more future proof. People do get value from using the non-English wordnets, and practically none of them use the old WN db format. Even exporting the Open English wordnet to the format is non-trivial (I would go so far as to say complicated) and I think this will become more difficult with time, rather than easier, as newer information is added. BTW, what features do you think are missing in Wn? I thought it was feature complete, but that the interface has changed a little, so it is not a drop in replacement. A shim to make it work with legacy code is the missing part, ... Although of course it does not have the new graph visualization code you added recently (which looks great BTW).

…

On Thu, Nov 4, 2021 at 3:06 PM Eric Kafe ***@***.***> wrote: Concerning the maintenability claim, an important factor is the size of the codebase: the existing wordnet module counts 2301 lines, while the new Wn module is more than triple that size, with 7242 lines, while not feature-complete. Also, who will maintain it? The standard module had many contributors, while future Wn contributions appear less certain. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2423 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRRUQR7GLJMIYECY3U3UKIWI3ANCNFSM4I444W6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

ekaf · 2021-11-04T09:37:22Z

Thanks @fcbond. I don't know enough about Wn's features, but I noticed that it doesn't seem to have a "sense key" notion, although sensekeys are the only reliable identifier across Wordnet versions. So I would miss functions like synset_from_sense_key() and lemma_from_key(), which are available in the standard module.

I totally agree that support for LMF is crucial, especially since it is an ISO standard, and many foreign languages are adopting it, although not all yet. But if, as according to http://compling.hss.ntu.edu.sg/omw/, "Wordnet-LMF format files are made by combining the tab files with the Princeton wordnet.", this seems to be just old wine in new bottles, without actually adding anything.

So, before it becomes relevant to replace the NLTK module, it would be nice to see a release of OMW that actually brings all the promised new features into play. Until all this is released and tested, discussions seem highly speculative.

tomaarsen · 2021-11-04T09:44:38Z

I would be in favor of moving towards a deprecation of the NLTK wordnet module, if Wn is sufficient feature complete to replace core NLTK Wordnet functionality. With some time, and the assistance of the Wn documentation, we might be able to provide detailed deprecation warnings on a function/method level. This way, we can help users move towards using Wn as quickly and smoothly as possible. This also allows us to deprecate the NLTK wordnet module more quickly.

The NLTK wordnet module would still be available upon deprecation for some time, but maintenance on it could be stopped. Similarly to the Stanford Tools, e.g. StanfordTagger (see #2812).

With the deprecated decorator we can provide these per-function deprecation warnings. Concretely, this might look like:
(In nltk/corpus/reader/wordnet.py)

    ...

    #############################################################
    # Retrieve synsets and lemmas.
    #############################################################

    @deprecated('''\
    The NLTK Wordnet module is deprecated.
    It is recommended to use Wn to replace this method.
    For example:
        >>> import wn
        >>> pwn = wn.Wordnet("pwn", "3.0")
        >>> ss = pwn.synsets("chat", pos="v")
    See https://wn.readthedocs.io/en/latest/guides/nltk-migration.html
    for more information.''')
    def synsets(self, lemma, pos=None, lang="eng", check_exceptions=True):
        """Load all synsets with a given lemma and part of speech tag.
        If no pos is specified, all synsets for all parts of speech
        will be loaded.
        If lang is specified, all the synsets associated with the lemma name
        of that language will be returned.
        """
        lemma = lemma.lower()

        ...

Then, when a user executes a file with:

from nltk.corpus import wordnet as wn

...

wn.synsets("chat", pos="v")

The output becomes:

[sic]\nltk_2423.py:2: DeprecationWarning: Function synsets() has been deprecated.
    The NLTK Wordnet module is deprecated.
    It is recommended to use Wn to replace this method.
    For example:
        >>> import wn
        >>> pwn = wn.Wordnet("pwn", "3.0")
        >>> ss = pwn.synsets("chat", pos="v")
    See https://wn.readthedocs.io/en/latest/guides/nltk-migration.html
    for more information.
  wn.synsets("chat", pos="v")

Note that this does not require a 1:1 mapping of functions between NLTK and Wn. If Wn has a different workflow to create the same output, then that is also fine, and can be specified both in the deprecation comment and the documentation. Long story short - I'm trying to avoid simply warning users with "The NLTK Wordnet module is deprecated. Use Wn instead.", as I don't think that will result in proper adoption of Wn by NLTK users.

If we indeed want to go this route, then I would like to perform some small changes to the deprecated decorator:

Expose a flag on the decorator which allows the indentation and newlines of the deprecation message to be preserved. (Currently it's always normalized, I removed that line for the above output)

Allow f-string formatting on variable names to be applied on the deprecation message. In short, a method that has self, lemma, pos=None, lang="eng", check_exceptions=True as the signature can then use a deprecation message like:

@deprecated('''\
The NLTK Wordnet module is deprecated.
It is recommended to use Wn to replace this method.
For example:
    >>> import wn
    >>> pwn = wn.Wordnet("pwn", "3.0")
    >>> ss = pwn.synsets({lemma}, pos={pos if pos else "v"}, lang={lang if lang else "eng"})
See https://wn.readthedocs.io/en/latest/guides/nltk-migration.html
for more information.''')

A global and simple way to hide these warnings.

These deprecation changes would likely be easy to implement, so that is not a concern.

Tom Aarsen

ekaf · 2021-11-04T14:36:07Z

globalwordnet/cili#9 (comment) mentions "a backlog of several years" in attributing the envisioned ILI identifiers. This leaves ample time to prepare the deprecation warnings.

goodmami · 2021-11-04T15:30:07Z

it doesn't seem to have a "sense key" notion, although sensekeys are the only reliable identifier across Wordnet versions

This is because sense keys are not a core part of the WN-LMF format. In EWN-2019 and EWN-2020, these were conventionally encoded in the dc:identifier attribute on <Sense> elements, and in OEWN-2021 they are simply the id of those elements, transformed slightly for xsd:id compatibility. The WN-LMF format allows for ili IDs for synsets, which are stable across versions and even across languages, but there is no equivalent mechanism for senses.

In the upcoming OMW 1.4 release we will provide an export of WordNet 3.0 and 3.1 that is more faithful to the source data than the currently distributed one. It will have sense keys, in their original form, as dc:identifier attributes, and it will also encode NLTK-style synset IDs (e.g., entity.n.01) using dc:identifier on <Synset> elements. This will allow the NLTK-compatibility shim to use the synset_from_sense_key() and lemma_from_key() functions, as well as the synset() function, but I still need to test this.

"a backlog of several years" in attributing the envisioned ILI identifiers

That backlog is for new ILIs introduced since the fork from the Princeton WordNet, and currently all the synsets in the OMW wordnets are present in WordNet 3.0, so I don't think that is relevant here. However I'd agree that there's no immediate rush to switch, except for those NLTK users looking for additional features. That is, for those with greater needs, there will be a migration path laid out, while others can continue to use the old module for a while.

@tomaarsen to that point I think the deprecation wording may be a bit strong. Maybe "will be deprecated in version X.Y.Z" instead of "is deprecated", and keep it that way for a version or two so users aren't caught off-guard. Your thoughts on changes to the deprecation message formatting sound good to me.

tomaarsen · 2021-11-04T15:35:37Z

to that point I think the deprecation wording may be a bit strong. Maybe "will be deprecated in version X.Y.Z" instead of "is deprecated", and keep it that way for a version or two so users aren't caught off-guard. Your thoughts on changes to the deprecation message formatting sound good to me.

The phrasing of the deprecation messages isn't an issue. My example was very much just an example. My main point was to advocate in favour of clear deprecation messages with code snippets. The details of the phrasing can be discussed if we decide to move forward 👍

goodmami · 2021-11-24T00:25:08Z

An update: I've pushed the current state of my NLTK-compatibility shim to Wn in the nltk branch. Here is the relevant file: https://github.com/goodmami/wn/blob/nltk/wn/nltk_api.py

It uses Wn under the hood, but the functions and even class representations mimic the NLTK's. You'll need OMW 1.4:

$ python -m wn download omw:1.4

And here is what it looks like:

>>> from wn import nltk_api as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> wn.synsets('犬', lang='jpn')
[Synset('dog.n.01'), Synset('spy.n.01')]
>>> wn.lemmas('犬', lang='jpn')
[Lemma('dog.n.01.犬'), Lemma('spy.n.01.犬')]
>>> wn.synsets('犬', lang='jpn')[0].hypernyms()  # works but returns Wn-native synsets
[Synset('omw-en-02083346-n'), Synset('omw-en-01317541-n')]
>>> dog = wn.synsets('dog')[0]
>>> cat = wn.synsets('cat')[0]
>>> wn.path_similarity(dog, cat)
0.2

There are still a number of things unimplemented or partially implemented. If anyone is interested in finishing up the module, I'd be very happy to receive PRs.

alvations added the wordnet label Oct 3, 2019

alvations mentioned this issue Oct 3, 2019

Deprecate NLTK's Wordnet interface and use this nltk/wordnet#13

Open

tomaarsen added the deprecation label Nov 4, 2021

ekaf mentioned this issue Nov 29, 2021

Support OMW 1.4 #2899

Merged

tomaarsen mentioned this issue Apr 8, 2022

Integrate wordnet english #2977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract wordnet into a separate package #2423

Extract wordnet into a separate package #2423

stevenbird commented Oct 3, 2019

alvations commented Oct 3, 2019

goodmami commented Sep 4, 2020

goodmami commented Nov 2, 2021

fcbond commented Nov 3, 2021 via email

ekaf commented Nov 3, 2021

goodmami commented Nov 3, 2021

ekaf commented Nov 4, 2021

fcbond commented Nov 4, 2021 via email

ekaf commented Nov 4, 2021

tomaarsen commented Nov 4, 2021

ekaf commented Nov 4, 2021

goodmami commented Nov 4, 2021

tomaarsen commented Nov 4, 2021

goodmami commented Nov 24, 2021

Extract wordnet into a separate package #2423

Extract wordnet into a separate package #2423

Comments

stevenbird commented Oct 3, 2019

alvations commented Oct 3, 2019

goodmami commented Sep 4, 2020

goodmami commented Nov 2, 2021

fcbond commented Nov 3, 2021 via email

ekaf commented Nov 3, 2021

goodmami commented Nov 3, 2021

ekaf commented Nov 4, 2021

fcbond commented Nov 4, 2021 via email

ekaf commented Nov 4, 2021

tomaarsen commented Nov 4, 2021

ekaf commented Nov 4, 2021

goodmami commented Nov 4, 2021

tomaarsen commented Nov 4, 2021

goodmami commented Nov 24, 2021