Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmn wordnet folder name and data .tab name doesn't match #2905

Closed
badGarnet opened this issue Dec 7, 2021 · 14 comments
Closed

cmn wordnet folder name and data .tab name doesn't match #2905

badGarnet opened this issue Dec 7, 2021 · 14 comments

Comments

@badGarnet
Copy link

nltk==3.6.5
Tested on python==3.8.12
Chinese language wordnet data is now stored in corpora/omw/cow/wn-data-cmn.tab
But code assumes folder name and wordnet data suffix is the same:

with self._omw_reader.open("{0:}/wn-data-{0:}.tab".format(lang)) as fp:
self.custom_lemmas(fp, lang)

Now loading cmn data raises error, e.g.,

❯ NLTK_DATA=/tmp ipython
Python 3.8.12 (default, Oct  5 2021, 17:04:41)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from nltk.corpus import wordnet as wn

In [2]: wn._load_lang_data("cmn")
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-718659b9166c> in <module>
----> 1 wn._load_lang_data("cmn")

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/corpus/reader/wordnet.py in _load_lang_data(self, lang)
   1195             raise WordNetError("Language is not supported.")
   1196
-> 1197         with self._omw_reader.open("{0:}/wn-data-{0:}.tab".format(lang)) as fp:
   1198             self.custom_lemmas(fp, lang)
   1199

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/corpus/reader/api.py in open(self, file)
    228         """
    229         encoding = self.encoding(file)
--> 230         stream = self._root.join(file).open(encoding)
    231         return stream
    232

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/data.py in join(self, fileid)
    332     def join(self, fileid):
    333         _path = os.path.join(self._path, fileid)
--> 334         return FileSystemPathPointer(_path)
    335
    336     def __repr__(self):

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/compat.py in _decorator(*args, **kwargs)
     39     def _decorator(*args, **kwargs):
     40         args = (args[0], add_py3_data(args[1])) + args[2:]
---> 41         return init_func(*args, **kwargs)
     42
     43     return wraps(init_func)(_decorator)

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/data.py in __init__(self, _path)
    310         _path = os.path.abspath(_path)
    311         if not os.path.exists(_path):
--> 312             raise OSError("No such file or directory: %r" % _path)
    313         self._path = _path
    314

OSError: No such file or directory: '/tmp/corpora/omw/cmn/wn-data-cmn.tab'
@stevenbird
Copy link
Member

Can any of you advise please, @ekaf, @goodmami, @fcbond

@ekaf
Copy link
Contributor

ekaf commented Dec 7, 2021

Yes, the directory structure of the new OMW version 1.4 (PR nltk/nltk_data#171) has changed: the directory name now denotes provenance and does no longer always match the language name. This only affects a few languages, in particular 'cmn', which is now in the 'cow' folder.
To interpret the new structure, you need the updated functions in wordnet.py PR: #2899, which has now been merged into NLTK's 'develop' branch .

@ekaf ekaf mentioned this issue Dec 8, 2021
@tomaarsen
Copy link
Member

@badGarnet Feel free to use the develop branch of nltk until the solution is published in the upcoming version.
I believe that would be:

pip install git+https://github.com/nltk/nltk.git 

@goodmami
Copy link
Contributor

goodmami commented Dec 8, 2021

But this will remain broken for anyone using an older version of the NLTK. Is it possible to have versioned data that the code can select for? E.g., omw.zip in nltk_data would be reverted to the previous version to support users of the current and previous versions of the NLTK, while something like omw-1.4.zip could be the new data that could be selected by the next code release of the NLTK?

@tomaarsen
Copy link
Member

I would be in favor of any method that allows current users to continue to use OMW without being forced to update NLTK. I raised some concerns about this in #2899 (review) too.
The suggestion by @goodmami seems like a proper solution. We could then, in nltk, look solely for omw-1.4.zip, while older versions still use the (working) omw.zip.
@ekaf What are your thoughts on this?

@ekaf
Copy link
Contributor

ekaf commented Dec 8, 2021

Yes, I was actually just thinking exactly the same. We could have different OMW packages, just like we now have a choice of different Wordnet versions

@tomaarsen
Copy link
Member

So, from now onwards, this is what needs to be done, if I didn't forget anything:

  • Rename the current omw.zip to omw-1.4.zip in nltk_data.
  • Revert omw.zip in nltk_data.
  • Modify nltk to use omw-1.4.zip by default.

Would there be a use case of accessing the old OMW data, or would we want NLTK 3.7.0 onwards to exclusively use omw-1.4.zip, while omw.zip exclusively exists for compatibility reasons?

@ekaf
Copy link
Contributor

ekaf commented Dec 8, 2021

@tomaarsen, your plan sounds good and feasible. Omw-1.4 supersedes the old omw package by adding a few new languages, plus definitions and examples, while the lemmas remain unchanged. So there is no case for preferring the old package, except for compatibility with old NLTK versions.
Fortunately, the updated OMW functions in wordnet.py support both the old and the new omw packages seamlessly. But the older wordnet.py does not handle omw-1.4 well, because a few languages have moved to a different directory than expected.

@ekaf
Copy link
Contributor

ekaf commented Dec 8, 2021

There actually is another way to fix this problem: since the 'cmn' language now is in the 'cow' directory, it would be sufficient to add a 'cmn' symlink, pointing to the 'cow' directory, and do the same for a small number of languages that have this problem. However, I'm not sure that symlinks exist in Windows (I haven't used it for a while :).

@tomaarsen
Copy link
Member

They do exist, I believe. That said, I'm not sure if you can create a symlink within the .zip in e.g. Linux, and have it work when unzipping in Windows.

@goodmami
Copy link
Contributor

goodmami commented Dec 8, 2021

One concern I have is with replicability of experiments. E.g., if someone creates a research project with a dependency on the NLTK, someone downloading that project code and fetching the relevant data should get the same results.

I agree, however, that there's no real use-case for providing both versions of OMW in one version of NLTK. That hypothetical research project would be pinned to an older version of the NLTK in order to fetch the older data. While the data for both remains up on a server and it seems like the NLTK should be able to fetch either one, we would: (a) have to ensure the code actually works with both versions, and (b) have to have some mechanism for selecting one version or another. For instance, in Wn I chose to allow versions on lexicon identifiers (e.g., wn.synsets('gato', lexicon='omw-es:1.4')). But that seems like a major change for the NLTK.

I also don't think symlinking is a great solution, not only for the technical pitfalls, but for the same replicability reasons.

@tomaarsen I think your proposal makes sense. I'm worried about those users that fetched the data in the meantime. We may need to post some instructions for deleting and re-downloading the OMW data, or, alternatively, for upgrading to the latest code version, depending on their needs.

@ekaf
Copy link
Contributor

ekaf commented Dec 8, 2021

Symlinks would not work as a general solution to provide omw-1.4 compatibility to older NLTK versions, because there also other problems. With some languages, the new data would fail to load anyway, since the parser expects solely triples, while the new definitions and examples have four fields. However, this additional field is an ordinal number, which we don't actually use, so it would be possible to remove it, in order to have only triples,

@tomaarsen
Copy link
Member

I'm worried about those users that fetched the data in the meantime. We may need to post some instructions for deleting and re-downloading the OMW data, or, alternatively, for upgrading to the latest code version, depending on their needs.

We have similar issues with the inaugural corpus, which was broken for several days. The zip file of the corpus did not include a folder with files, but rather it was just a zip file with files directly. Everyone who downloaded the corpus until I fixed it a few days afterwards now has a bunch of arbitrary .txt files in their nltk_data/corpora folder.

Let's leave the idea of symlinks as a potential solution, then.

E.g., if someone creates a research project with a dependency on the NLTK, someone downloading that project code and fetching the relevant data should get the same results.

If they use the same version of NLTK, then it will. And if they opt to use NLTK >= 3.7.0 (assuming we go through with the changes from #2905 (comment) and release this under 3.7.0), then they will experience different results. It's also not particularly shocking that different releases of a module such as NLTK would have slightly different results on certain functions.

I believe that would be the best course of action, as it still leaves users with the opportunity to use the old data: by using some NLTK version lower than 3.7.0.

@tomaarsen
Copy link
Member

I believe this has now been solved both on develop and for older NLTK versions. From the upcoming release onwards (including the current develop branch), OMW 1.4 will be used. On all prior versions, the previous version of OMW will continue to be accessible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants