Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle wordnet synsets that were lost in mapping #2985

Merged
merged 5 commits into from
May 3, 2022

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Apr 18, 2022

This PR fixes #2984 so that, when a wordsense has been lost in a wordnet mapping, the remaining senses of the word can still be retrieved. For ex: instead of raising a fatal error, the following retrieves the French senses of 'perceptible' in WordNet 3.1:

from nltk.corpus import wordnet31 as wn
print(wn.synsets('perceptible', lang='fra'))

[Synset('detectable.s.02'), Synset('None'), Synset('perceptible.a.01')]

@ekaf
Copy link
Contributor Author

ekaf commented Apr 18, 2022

Test failure on Ubuntu only, seems unrelated with this PR:

NLTK was unable to find the prover9 file!
Use software specific configuration parameters or set the PROVER9 environment variable.

@ekaf
Copy link
Contributor Author

ekaf commented Apr 20, 2022

All tests passed after changing the key of the third-party cache in ci.yaml. This change should probably be undone, but I don't know if that would restore the bad cache.

@ekaf ekaf mentioned this pull request Apr 21, 2022
@stevenbird stevenbird self-assigned this Apr 29, 2022
@tomaarsen
Copy link
Member

Apologies for the mess with the CI tests... One of these days we'll manage to get a test suite working with some consistency. Truthfully, I'm not sure what went wrong with those failing tests. I haven't had the time to help you out with the tests or look at this work, sadly, so I'll leave this for Steven, or for later, but I wanted to mention that I appreciate the work you're putting into NLTK's WordNet.

@stevenbird stevenbird merged commit f6ffbc2 into nltk:develop May 3, 2022
@stevenbird
Copy link
Member

Thank you @ekaf. Sorry about our CI consistency issues!

@ekaf
Copy link
Contributor Author

ekaf commented May 3, 2022

Thanks @tomaarsen and @stevenbird! According to actions/cache#2 (comment), other projects see persistent corruptions of the CI cache, so it is not a specific NLTK issue.

But maybe the actions/cache docs include ideas that could be applied from the NLTK side. For ex. the keys used by NLTK in .github/workflows/ci.yaml include the variable ${{ secrets.CACHE_VERSION }}. The value of this variable is interpreted as void, because secrets are not passed publicly (which makes sense, since they would not be secrets otherwise), so NLTK keys might as well have nothing instead of this variable. Then it could be advantageous to use some other variable instead, for ex. the current Year/Month/Day; this would only be a slight improvement though, limiting eventual cache corruption to one day.

@ekaf ekaf deleted the lost_in_map branch May 3, 2022 07:10
@tomaarsen
Copy link
Member

tomaarsen commented May 3, 2022

I added the current ${{ secrets.CACHE_VERSION }} variable as a way to quickly modify the cache key, so that we can update the key on the fly, and restart any runs that were failed due some cache issue. This actually works quite well, but only on nltk/nltk: the secret is not propagated down to any forks, causing the approach to not work on any PRs.

This problem of not being able to easily update the cache on a PR is one that I've looked at before. I've tried an approach where if the last commit message contains no-cache, then the CI would not use the existing cache (and just make a new one instead), but I could not get that to work consistently either. The idea was that a user could perform git commit -m "Reset the cache, no-cache" --allow-empty to push an empty commit that reset the cache.

These are fixes of a consequence, while your solution could be the fix to the cause. Ensuring there is a new cache every e.g. week could prevent the cache from ever corrupting. That is just speculation, though.

@ekaf
Copy link
Contributor Author

ekaf commented May 3, 2022

@tomaarsen, according to the actions/cache/README.md:

A repository can have up to 10GB of caches. Once the 10GB limit is reached, older caches will be evicted based on when the cache was last accessed. Caches that are not accessed within the last week will also be evicted.

It could be nice, not to have to wait that long though, and the same page includes a recipe for automatically refreshing the cache each day. In NLTK's case, this would approximatively incur a cost of 2.5 Gbs*30 days.
This proposal is a workaround, rather than a solution, because it doesn't handle the root cause, namely the fact that PR runs can corrupt the default cache persistently. Ideally, a PR would just be able to read the cache, but not to write back into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Senses lost in WordNet mappings
3 participants