`cmn` wordnet folder name and data `.tab` name doesn't match #2905

badGarnet · 2021-12-07T16:33:44Z

nltk==3.6.5
Tested on python==3.8.12
Chinese language wordnet data is now stored in corpora/omw/cow/wn-data-cmn.tab
But code assumes folder name and wordnet data suffix is the same:

nltk/nltk/corpus/reader/wordnet.py

Lines 1253 to 1254 in f50b6b1

    
           with self._omw_reader.open("{0:}/wn-data-{0:}.tab".format(lang)) as fp: 
        
               self.custom_lemmas(fp, lang)

Now loading cmn data raises error, e.g.,

❯ NLTK_DATA=/tmp ipython
Python 3.8.12 (default, Oct  5 2021, 17:04:41)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from nltk.corpus import wordnet as wn

In [2]: wn._load_lang_data("cmn")
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-718659b9166c> in <module>
----> 1 wn._load_lang_data("cmn")

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/corpus/reader/wordnet.py in _load_lang_data(self, lang)
   1195             raise WordNetError("Language is not supported.")
   1196
-> 1197         with self._omw_reader.open("{0:}/wn-data-{0:}.tab".format(lang)) as fp:
   1198             self.custom_lemmas(fp, lang)
   1199

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/corpus/reader/api.py in open(self, file)
    228         """
    229         encoding = self.encoding(file)
--> 230         stream = self._root.join(file).open(encoding)
    231         return stream
    232

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/data.py in join(self, fileid)
    332     def join(self, fileid):
    333         _path = os.path.join(self._path, fileid)
--> 334         return FileSystemPathPointer(_path)
    335
    336     def __repr__(self):

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/compat.py in _decorator(*args, **kwargs)
     39     def _decorator(*args, **kwargs):
     40         args = (args[0], add_py3_data(args[1])) + args[2:]
---> 41         return init_func(*args, **kwargs)
     42
     43     return wraps(init_func)(_decorator)

~/.pyenv/versions/3.8.12/envs/pcore-dec7/lib/python3.8/site-packages/nltk/data.py in __init__(self, _path)
    310         _path = os.path.abspath(_path)
    311         if not os.path.exists(_path):
--> 312             raise OSError("No such file or directory: %r" % _path)
    313         self._path = _path
    314

OSError: No such file or directory: '/tmp/corpora/omw/cmn/wn-data-cmn.tab'

The text was updated successfully, but these errors were encountered:

stevenbird · 2021-12-07T21:12:46Z

Can any of you advise please, @ekaf, @goodmami, @fcbond

ekaf · 2021-12-07T23:29:38Z

Yes, the directory structure of the new OMW version 1.4 (PR nltk/nltk_data#171) has changed: the directory name now denotes provenance and does no longer always match the language name. This only affects a few languages, in particular 'cmn', which is now in the 'cow' folder.
To interpret the new structure, you need the updated functions in wordnet.py PR: #2899, which has now been merged into NLTK's 'develop' branch .

tomaarsen · 2021-12-08T13:27:32Z

@badGarnet Feel free to use the develop branch of nltk until the solution is published in the upcoming version.
I believe that would be:

pip install git+https://github.com/nltk/nltk.git

goodmami · 2021-12-08T15:03:14Z

But this will remain broken for anyone using an older version of the NLTK. Is it possible to have versioned data that the code can select for? E.g., omw.zip in nltk_data would be reverted to the previous version to support users of the current and previous versions of the NLTK, while something like omw-1.4.zip could be the new data that could be selected by the next code release of the NLTK?

tomaarsen · 2021-12-08T15:06:24Z

I would be in favor of any method that allows current users to continue to use OMW without being forced to update NLTK. I raised some concerns about this in #2899 (review) too.
The suggestion by @goodmami seems like a proper solution. We could then, in nltk, look solely for omw-1.4.zip, while older versions still use the (working) omw.zip.
@ekaf What are your thoughts on this?

ekaf · 2021-12-08T15:10:04Z

Yes, I was actually just thinking exactly the same. We could have different OMW packages, just like we now have a choice of different Wordnet versions

tomaarsen · 2021-12-08T15:13:12Z

So, from now onwards, this is what needs to be done, if I didn't forget anything:

Rename the current omw.zip to omw-1.4.zip in nltk_data.
Revert omw.zip in nltk_data.
Modify nltk to use omw-1.4.zip by default.

Would there be a use case of accessing the old OMW data, or would we want NLTK 3.7.0 onwards to exclusively use omw-1.4.zip, while omw.zip exclusively exists for compatibility reasons?

ekaf · 2021-12-08T15:33:44Z

@tomaarsen, your plan sounds good and feasible. Omw-1.4 supersedes the old omw package by adding a few new languages, plus definitions and examples, while the lemmas remain unchanged. So there is no case for preferring the old package, except for compatibility with old NLTK versions.
Fortunately, the updated OMW functions in wordnet.py support both the old and the new omw packages seamlessly. But the older wordnet.py does not handle omw-1.4 well, because a few languages have moved to a different directory than expected.

ekaf · 2021-12-08T15:41:03Z

There actually is another way to fix this problem: since the 'cmn' language now is in the 'cow' directory, it would be sufficient to add a 'cmn' symlink, pointing to the 'cow' directory, and do the same for a small number of languages that have this problem. However, I'm not sure that symlinks exist in Windows (I haven't used it for a while :).

tomaarsen · 2021-12-08T15:46:20Z

They do exist, I believe. That said, I'm not sure if you can create a symlink within the .zip in e.g. Linux, and have it work when unzipping in Windows.

goodmami · 2021-12-08T17:05:10Z

One concern I have is with replicability of experiments. E.g., if someone creates a research project with a dependency on the NLTK, someone downloading that project code and fetching the relevant data should get the same results.

I agree, however, that there's no real use-case for providing both versions of OMW in one version of NLTK. That hypothetical research project would be pinned to an older version of the NLTK in order to fetch the older data. While the data for both remains up on a server and it seems like the NLTK should be able to fetch either one, we would: (a) have to ensure the code actually works with both versions, and (b) have to have some mechanism for selecting one version or another. For instance, in Wn I chose to allow versions on lexicon identifiers (e.g., wn.synsets('gato', lexicon='omw-es:1.4')). But that seems like a major change for the NLTK.

I also don't think symlinking is a great solution, not only for the technical pitfalls, but for the same replicability reasons.

@tomaarsen I think your proposal makes sense. I'm worried about those users that fetched the data in the meantime. We may need to post some instructions for deleting and re-downloading the OMW data, or, alternatively, for upgrading to the latest code version, depending on their needs.

ekaf · 2021-12-08T17:09:07Z

Symlinks would not work as a general solution to provide omw-1.4 compatibility to older NLTK versions, because there also other problems. With some languages, the new data would fail to load anyway, since the parser expects solely triples, while the new definitions and examples have four fields. However, this additional field is an ordinal number, which we don't actually use, so it would be possible to remove it, in order to have only triples,

tomaarsen · 2021-12-08T17:44:01Z

I'm worried about those users that fetched the data in the meantime. We may need to post some instructions for deleting and re-downloading the OMW data, or, alternatively, for upgrading to the latest code version, depending on their needs.

We have similar issues with the inaugural corpus, which was broken for several days. The zip file of the corpus did not include a folder with files, but rather it was just a zip file with files directly. Everyone who downloaded the corpus until I fixed it a few days afterwards now has a bunch of arbitrary .txt files in their nltk_data/corpora folder.

Let's leave the idea of symlinks as a potential solution, then.

E.g., if someone creates a research project with a dependency on the NLTK, someone downloading that project code and fetching the relevant data should get the same results.

If they use the same version of NLTK, then it will. And if they opt to use NLTK >= 3.7.0 (assuming we go through with the changes from #2905 (comment) and release this under 3.7.0), then they will experience different results. It's also not particularly shocking that different releases of a module such as NLTK would have slightly different results on certain functions.

I believe that would be the best course of action, as it still leaves users with the opportunity to use the old data: by using some NLTK version lower than 3.7.0.

tomaarsen · 2021-12-14T13:42:17Z

I believe this has now been solved both on develop and for older NLTK versions. From the upcoming release onwards (including the current develop branch), OMW 1.4 will be used. On all prior versions, the previous version of OMW will continue to be accessible.

ekaf mentioned this issue Dec 8, 2021

Support OMW 1.4 #2899

Merged

tomaarsen added critical wordnet labels Dec 8, 2021

This was referenced Dec 8, 2021

OMW compatibility with old NLTK versions nltk/nltk_data#175

Merged

Renamed omw to omw-1.4 #2907

Merged

tomaarsen mentioned this issue Dec 9, 2021

Add omw-1.4.xml to allow OMW 1.4 to be downloaded nltk/nltk_data#176

Merged

tomaarsen closed this as completed Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cmn` wordnet folder name and data `.tab` name doesn't match #2905

`cmn` wordnet folder name and data `.tab` name doesn't match #2905

badGarnet commented Dec 7, 2021

stevenbird commented Dec 7, 2021

ekaf commented Dec 7, 2021 •

edited

tomaarsen commented Dec 8, 2021

goodmami commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

goodmami commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

tomaarsen commented Dec 14, 2021

cmn wordnet folder name and data .tab name doesn't match #2905

cmn wordnet folder name and data .tab name doesn't match #2905

Comments

badGarnet commented Dec 7, 2021

stevenbird commented Dec 7, 2021

ekaf commented Dec 7, 2021 • edited

tomaarsen commented Dec 8, 2021

goodmami commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

ekaf commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

goodmami commented Dec 8, 2021

ekaf commented Dec 8, 2021

tomaarsen commented Dec 8, 2021

tomaarsen commented Dec 14, 2021

`cmn` wordnet folder name and data `.tab` name doesn't match #2905

`cmn` wordnet folder name and data `.tab` name doesn't match #2905

ekaf commented Dec 7, 2021 •

edited