New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmn
wordnet folder name and data .tab
name doesn't match
#2905
Comments
Yes, the directory structure of the new OMW version 1.4 (PR nltk/nltk_data#171) has changed: the directory name now denotes provenance and does no longer always match the language name. This only affects a few languages, in particular 'cmn', which is now in the 'cow' folder. |
@badGarnet Feel free to use the
|
But this will remain broken for anyone using an older version of the NLTK. Is it possible to have versioned data that the code can select for? E.g., |
I would be in favor of any method that allows current users to continue to use OMW without being forced to update NLTK. I raised some concerns about this in #2899 (review) too. |
Yes, I was actually just thinking exactly the same. We could have different OMW packages, just like we now have a choice of different Wordnet versions |
So, from now onwards, this is what needs to be done, if I didn't forget anything:
Would there be a use case of accessing the old OMW data, or would we want NLTK 3.7.0 onwards to exclusively use |
@tomaarsen, your plan sounds good and feasible. Omw-1.4 supersedes the old omw package by adding a few new languages, plus definitions and examples, while the lemmas remain unchanged. So there is no case for preferring the old package, except for compatibility with old NLTK versions. |
There actually is another way to fix this problem: since the 'cmn' language now is in the 'cow' directory, it would be sufficient to add a 'cmn' symlink, pointing to the 'cow' directory, and do the same for a small number of languages that have this problem. However, I'm not sure that symlinks exist in Windows (I haven't used it for a while :). |
They do exist, I believe. That said, I'm not sure if you can create a symlink within the .zip in e.g. Linux, and have it work when unzipping in Windows. |
One concern I have is with replicability of experiments. E.g., if someone creates a research project with a dependency on the NLTK, someone downloading that project code and fetching the relevant data should get the same results. I agree, however, that there's no real use-case for providing both versions of OMW in one version of NLTK. That hypothetical research project would be pinned to an older version of the NLTK in order to fetch the older data. While the data for both remains up on a server and it seems like the NLTK should be able to fetch either one, we would: (a) have to ensure the code actually works with both versions, and (b) have to have some mechanism for selecting one version or another. For instance, in Wn I chose to allow versions on lexicon identifiers (e.g., I also don't think symlinking is a great solution, not only for the technical pitfalls, but for the same replicability reasons. @tomaarsen I think your proposal makes sense. I'm worried about those users that fetched the data in the meantime. We may need to post some instructions for deleting and re-downloading the OMW data, or, alternatively, for upgrading to the latest code version, depending on their needs. |
Symlinks would not work as a general solution to provide omw-1.4 compatibility to older NLTK versions, because there also other problems. With some languages, the new data would fail to load anyway, since the parser expects solely triples, while the new definitions and examples have four fields. However, this additional field is an ordinal number, which we don't actually use, so it would be possible to remove it, in order to have only triples, |
We have similar issues with the Let's leave the idea of symlinks as a potential solution, then.
If they use the same version of NLTK, then it will. And if they opt to use NLTK >= 3.7.0 (assuming we go through with the changes from #2905 (comment) and release this under 3.7.0), then they will experience different results. It's also not particularly shocking that different releases of a module such as NLTK would have slightly different results on certain functions. I believe that would be the best course of action, as it still leaves users with the opportunity to use the old data: by using some NLTK version lower than 3.7.0. |
I believe this has now been solved both on |
nltk==3.6.5
Tested on
python==3.8.12
Chinese language wordnet data is now stored in
corpora/omw/cow/wn-data-cmn.tab
But code assumes folder name and wordnet data suffix is the same:
nltk/nltk/corpus/reader/wordnet.py
Lines 1253 to 1254 in f50b6b1
Now loading
cmn
data raises error, e.g.,The text was updated successfully, but these errors were encountered: