Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract wordnet into a separate package #2423

Open
stevenbird opened this issue Oct 3, 2019 · 14 comments
Open

Extract wordnet into a separate package #2423

stevenbird opened this issue Oct 3, 2019 · 14 comments

Comments

@stevenbird
Copy link
Member

@alvations has created https://github.com/nltk/wordnet
We need to deprecate the existing NLTK wordnet corpus reader

@alvations
Copy link
Contributor

@fcbond any advice on this?

@goodmami
Copy link
Contributor

goodmami commented Sep 4, 2020

I've been consulting with @alvations and @fcbond to work on an entirely new wordnet module: https://github.com/goodmami/wn. It's not yet feature-complete nor documented (contributions are welcome), but the basics are working fairly well.

It better adheres to the Word–Sense–Synset structure used by wordnets instead of the Lemma–Synset simplified structure used by the NLTK module. As such, the API isn't exactly the same as the NLTK module or the spun-out wordnet module.

What is the plan regarding the NLTK? Will there no longer be an NLTK wordnet module? Or is the hope to replace it with a new one? If the latter, are there requirements regarding the API?

@goodmami
Copy link
Contributor

goodmami commented Nov 2, 2021

Hi, returning to this with an update: the Wn library has been, since shortly after my last comment, documented and reasonably feature-complete since the 0.8.0 release about four months ago. Recently I've been working on increasing compatibility with the NLTK's wordnet, on two fronts:

  1. A (mostly trivial) Python shim module that implements the NLTK's API on top of Wn's
  2. Updating the OMW's data to fix some regressions in its export of the WordNet 3.0 data and so we can recreate the NLTK's synset identifiers (e.g., dog.n.01)

The idea is that these two things would allow it to pass the NLTK's doctests. I never got a response to my questions above, so I'm not sure if the goal was to integrate the new system into the NLTK or to just change the documentation to refer to it. There was some more discussion at nltk/nltk_data#160.

Meanwhile it seems that @ekaf has been also doing good work (#2860, nltk/nltk_data#165) to incorporate WordNet 3.1 and to bring the WNDB export of the latest Open English Wordnet (OEWN) to the NLTK. These are good and welcome improvements, but render my earlier efforts somewhat unnecessary. In addition, the WordNet 3.1 and OEWN 2021 synset offsets are not compatible with the OMW's tab files, so users must continue to use WordNet 3.0 for them to work. @fcbond would know if it's difficult to re-export the OMW for these newer wordnets, although in the WN-LMF-formatted lexicons we have a version-independent solution: the interlingual index.

So I'd like to know if I should stop working on NLTK-compatibility for the Wn library. Certainly @ekaf's solution is a less disruptive way to bring NLTK users up to speed, at least for English. If so, I think this issue can be closed.

@fcbond
Copy link
Contributor

fcbond commented Nov 3, 2021 via email

@ekaf
Copy link
Contributor

ekaf commented Nov 3, 2021

Thanks @goodmami for your kind words about my recent updates in the standard wordnet module. It is indeed in better shape now, although the lack of compatibility of OMW with newer Wordnet versions is a shortcoming.

Concerning future plans, a pluralism of approaches is always an advantage and, as @fcbond explains, the new Wn model is very helpful for supporting the LMF format, which NLTK's standard wordnet module doesn't. The combination of ILI and OEWN is certainly promising, and I also look forward to seeing the OMW catching up with the recent Wordnet developments.

On the other hand, OEWN releases still tend to have much more bugs than Princeton WordNet, so people might still want to have a choice for some time. The standard NLTK wordnet module has much functionality, a large user base, and is still useful until something better comes up.

@goodmami
Copy link
Contributor

goodmami commented Nov 3, 2021

We could try to (a) replace the existing module, or (b) give people a choice of both? In terms of long term ease of maintenance, I favor (a).

We could follow what the Python documentation does in similar situations. For instance, urllib.request simply recommends that people use the 3rd-party Requests package, or how the re module recommends the 3rd-party regex package, but the standard library modules are still there. Some reasons for not incorporating these into Python's standard library: slower release cycle for Python, maintenance burden, little gain since they are readily available on PyPI. But adding them to the standard library makes them visible and gives them status as the standard way to do things.

So for Wn and NLTK, maybe the docs could point to Wn for the duration of NLTK's 3.* major version, then for 4.0 we can decide if it's a good idea to make the switch? Waiting for 4.0 wouldn't be necessary if we could guarantee a compatible API, however (or does NLTK not use semantic versioning?).

@ekaf
Copy link
Contributor

ekaf commented Nov 4, 2021

Concerning the maintenability claim, an important factor is the size of the codebase: the existing wordnet module counts 2301 lines, while the new Wn module is more than triple that size, with 7242 lines, while not feature-complete. Also, who will maintain it? The standard module had many contributors, while future Wn contributions appear less certain.

@fcbond
Copy link
Contributor

fcbond commented Nov 4, 2021 via email

@ekaf
Copy link
Contributor

ekaf commented Nov 4, 2021

Thanks @fcbond. I don't know enough about Wn's features, but I noticed that it doesn't seem to have a "sense key" notion, although sensekeys are the only reliable identifier across Wordnet versions. So I would miss functions like synset_from_sense_key() and lemma_from_key(), which are available in the standard module.

I totally agree that support for LMF is crucial, especially since it is an ISO standard, and many foreign languages are adopting it, although not all yet. But if, as according to http://compling.hss.ntu.edu.sg/omw/, "Wordnet-LMF format files are made by combining the tab files with the Princeton wordnet.", this seems to be just old wine in new bottles, without actually adding anything.

So, before it becomes relevant to replace the NLTK module, it would be nice to see a release of OMW that actually brings all the promised new features into play. Until all this is released and tested, discussions seem highly speculative.

@tomaarsen
Copy link
Member

I would be in favor of moving towards a deprecation of the NLTK wordnet module, if Wn is sufficient feature complete to replace core NLTK Wordnet functionality. With some time, and the assistance of the Wn documentation, we might be able to provide detailed deprecation warnings on a function/method level. This way, we can help users move towards using Wn as quickly and smoothly as possible. This also allows us to deprecate the NLTK wordnet module more quickly.

The NLTK wordnet module would still be available upon deprecation for some time, but maintenance on it could be stopped. Similarly to the Stanford Tools, e.g. StanfordTagger (see #2812).

With the deprecated decorator we can provide these per-function deprecation warnings. Concretely, this might look like:
(In nltk/corpus/reader/wordnet.py)

    ...

    #############################################################
    # Retrieve synsets and lemmas.
    #############################################################

    @deprecated('''\
    The NLTK Wordnet module is deprecated.
    It is recommended to use Wn to replace this method.
    For example:
        >>> import wn
        >>> pwn = wn.Wordnet("pwn", "3.0")
        >>> ss = pwn.synsets("chat", pos="v")
    See https://wn.readthedocs.io/en/latest/guides/nltk-migration.html
    for more information.''')
    def synsets(self, lemma, pos=None, lang="eng", check_exceptions=True):
        """Load all synsets with a given lemma and part of speech tag.
        If no pos is specified, all synsets for all parts of speech
        will be loaded.
        If lang is specified, all the synsets associated with the lemma name
        of that language will be returned.
        """
        lemma = lemma.lower()

        ...

Then, when a user executes a file with:

from nltk.corpus import wordnet as wn

...

wn.synsets("chat", pos="v")

The output becomes:

[sic]\nltk_2423.py:2: DeprecationWarning: Function synsets() has been deprecated.
    The NLTK Wordnet module is deprecated.
    It is recommended to use Wn to replace this method.
    For example:
        >>> import wn
        >>> pwn = wn.Wordnet("pwn", "3.0")
        >>> ss = pwn.synsets("chat", pos="v")
    See https://wn.readthedocs.io/en/latest/guides/nltk-migration.html
    for more information.
  wn.synsets("chat", pos="v")

Note that this does not require a 1:1 mapping of functions between NLTK and Wn. If Wn has a different workflow to create the same output, then that is also fine, and can be specified both in the deprecation comment and the documentation. Long story short - I'm trying to avoid simply warning users with "The NLTK Wordnet module is deprecated. Use Wn instead.", as I don't think that will result in proper adoption of Wn by NLTK users.

If we indeed want to go this route, then I would like to perform some small changes to the deprecated decorator:

  • Expose a flag on the decorator which allows the indentation and newlines of the deprecation message to be preserved. (Currently it's always normalized, I removed that line for the above output)
  • Allow f-string formatting on variable names to be applied on the deprecation message. In short, a method that has self, lemma, pos=None, lang="eng", check_exceptions=True as the signature can then use a deprecation message like:
    @deprecated('''\
    The NLTK Wordnet module is deprecated.
    It is recommended to use Wn to replace this method.
    For example:
        >>> import wn
        >>> pwn = wn.Wordnet("pwn", "3.0")
        >>> ss = pwn.synsets({lemma}, pos={pos if pos else "v"}, lang={lang if lang else "eng"})
    See https://wn.readthedocs.io/en/latest/guides/nltk-migration.html
    for more information.''')
  • A global and simple way to hide these warnings.

These deprecation changes would likely be easy to implement, so that is not a concern.

  • Tom Aarsen

@ekaf
Copy link
Contributor

ekaf commented Nov 4, 2021

globalwordnet/cili#9 (comment) mentions "a backlog of several years" in attributing the envisioned ILI identifiers. This leaves ample time to prepare the deprecation warnings.

@goodmami
Copy link
Contributor

goodmami commented Nov 4, 2021

it doesn't seem to have a "sense key" notion, although sensekeys are the only reliable identifier across Wordnet versions

This is because sense keys are not a core part of the WN-LMF format. In EWN-2019 and EWN-2020, these were conventionally encoded in the dc:identifier attribute on <Sense> elements, and in OEWN-2021 they are simply the id of those elements, transformed slightly for xsd:id compatibility. The WN-LMF format allows for ili IDs for synsets, which are stable across versions and even across languages, but there is no equivalent mechanism for senses.

In the upcoming OMW 1.4 release we will provide an export of WordNet 3.0 and 3.1 that is more faithful to the source data than the currently distributed one. It will have sense keys, in their original form, as dc:identifier attributes, and it will also encode NLTK-style synset IDs (e.g., entity.n.01) using dc:identifier on <Synset> elements. This will allow the NLTK-compatibility shim to use the synset_from_sense_key() and lemma_from_key() functions, as well as the synset() function, but I still need to test this.

"a backlog of several years" in attributing the envisioned ILI identifiers

That backlog is for new ILIs introduced since the fork from the Princeton WordNet, and currently all the synsets in the OMW wordnets are present in WordNet 3.0, so I don't think that is relevant here. However I'd agree that there's no immediate rush to switch, except for those NLTK users looking for additional features. That is, for those with greater needs, there will be a migration path laid out, while others can continue to use the old module for a while.

@tomaarsen to that point I think the deprecation wording may be a bit strong. Maybe "will be deprecated in version X.Y.Z" instead of "is deprecated", and keep it that way for a version or two so users aren't caught off-guard. Your thoughts on changes to the deprecation message formatting sound good to me.

@tomaarsen
Copy link
Member

to that point I think the deprecation wording may be a bit strong. Maybe "will be deprecated in version X.Y.Z" instead of "is deprecated", and keep it that way for a version or two so users aren't caught off-guard. Your thoughts on changes to the deprecation message formatting sound good to me.

The phrasing of the deprecation messages isn't an issue. My example was very much just an example. My main point was to advocate in favour of clear deprecation messages with code snippets. The details of the phrasing can be discussed if we decide to move forward 👍

@goodmami
Copy link
Contributor

An update: I've pushed the current state of my NLTK-compatibility shim to Wn in the nltk branch. Here is the relevant file: https://github.com/goodmami/wn/blob/nltk/wn/nltk_api.py

It uses Wn under the hood, but the functions and even class representations mimic the NLTK's. You'll need OMW 1.4:

$ python -m wn download omw:1.4

And here is what it looks like:

>>> from wn import nltk_api as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> wn.synsets('', lang='jpn')
[Synset('dog.n.01'), Synset('spy.n.01')]
>>> wn.lemmas('', lang='jpn')
[Lemma('dog.n.01.犬'), Lemma('spy.n.01.犬')]
>>> wn.synsets('', lang='jpn')[0].hypernyms()  # works but returns Wn-native synsets
[Synset('omw-en-02083346-n'), Synset('omw-en-01317541-n')]
>>> dog = wn.synsets('dog')[0]
>>> cat = wn.synsets('cat')[0]
>>> wn.path_similarity(dog, cat)
0.2

There are still a number of things unimplemented or partially implemented. If anyone is interested in finishing up the module, I'd be very happy to receive PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants