Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alternative Wordnet versions #2860

Merged
merged 7 commits into from Oct 26, 2021
Merged

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Oct 19, 2021

Solve issue #179 by allowing to load alternative Wordnet versions alongside the current Wordnet 3.0.

Currently, the multilingual OMW functions only work with Wordnet 3.0, so we need to raise a warning when loading other versions. For example, with the proposed new wordnet31 data package (nltk/nltk_data#165):

UserWarning: The multilingual functions are not available with this Wordnet version

Issue #2565 is solved in Wordnet 3.1:

import nltk
print("nltk v. %s" % nltk.version)

nltk v. 3.6.5

from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet31 as wn31
print("Wordnet %s: %s" % (wn.get_version(), wn.synset('restrain.v.01').hyponyms()))

Wordnet 3.0: [Synset('confine.v.03'), Synset('control.v.02'), Synset('hold.v.36'), Synset('inhibit.v.04')]

print("Wordnet %s: %s" % (wn31.get_version(), wn31.synset('restrain.v.01').hyponyms()))

Wordnet 3.1: [Synset('enchain.v.01'), Synset('fetter.v.01'), Synset('ground.v.02'), Synset('impound.v.02'), Synset('pen_up.v.01'), Synset('pinion.v.01'), Synset('pound.v.06'), Synset('tie_down.v.01')]

This PR also supports the forthcoming 2021 release of the Open English Wordnet (https://github.com/globalwordnet/english-wordnet).

@tomaarsen
Copy link
Member

Hello @ekaf
As you may have noticed, the tests are failing. This is because we've started using pre-commit, which ensures that all code in the codebase is formatted etc.
See this comment for more information on how to make your tests pass again: #2798 (comment).

Let me know if that works, and if you need any help.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 19, 2021

Thanks @tomaarsen ! My problem was that I have upgraded to python-3.10, and needed to reinstall all the pip-requirements,
which still fails on python-crfsuite, so I had not got around to installing pre-commit. Now I have installed all the other requirements (except python-crfsuite), and run pre-commit install. It seems a little better, but many tests still fail.

@tomaarsen
Copy link
Member

@ekaf You're right. These test failures are not your fault.
They are a consequence of nltk/nltk_data@92f9667. This commit has updated the languages for which there are stopwords, which has messed up a doctest. I'll jump on this, hang tight.

@tomaarsen
Copy link
Member

tomaarsen commented Oct 19, 2021

I've skipped the test. It was too vulnerable to future changes to stopwords for my likings. Beyond that, I've merged develop into your branch for this PR. Hopefully the tests pass now. (Note, tests may also fail due to nltk_data caching, but we'll see)

nltk/corpus/__init__.py Outdated Show resolved Hide resolved
@tomaarsen
Copy link
Member

Thanks for the inclusion of the doctests! This is ready for merging as far as I'm concerned. I'll have to wait the tests out, and then I'll merge.

@tomaarsen
Copy link
Member

It seems get_version now fails for the regular wordnet on the CI, sadly.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 25, 2021

It's strange... This test works with python but not with pytest. I have no clue why.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 26, 2021

The failing test would succed when moved to the top of wordnet.doctest. The bug was in get_version, which did not rewind the data.adj file before searching for the version string.

@tomaarsen
Copy link
Member

Wonderful. I hope it wasn't an issue of some other function failing to revert the file position when it should have. I suspect this would have other unexpected consequences.
That said, I'm happy with this PR, and I'll be eagerly awaiting your future PRs for Wordnet 2021, both here and in nltk_data. Thank you for these changes!

@tomaarsen tomaarsen merged commit 00a5d80 into nltk:develop Oct 26, 2021
@ekaf ekaf deleted the alternative_wordnets branch October 26, 2021 18:01
@ekaf
Copy link
Contributor Author

ekaf commented Oct 26, 2021

Thanks @tomaarsen! I have the same suspicion as you: there is certainly something wrong, because it's not ok that other tests leave data.adj in a dirty state. There must be an issue, and I intend to look into it.

@tomaarsen
Copy link
Member

Happy to hear. Feel free to open an issue or PR when you've made some progress - it's out of the scope of this PR. Perhaps if I have some time myself I'll be able to search for an issue too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants