Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renamed omw to omw-1.4 #2907

Merged
merged 5 commits into from Dec 10, 2021
Merged

Renamed omw to omw-1.4 #2907

merged 5 commits into from Dec 10, 2021

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Dec 8, 2021

This PR selects omw-1.4 as the default package to use together with Wordnet. This implements the third point in @tomaarsen's plan from #2905 (comment), in order to ensure a smooth transition to the new OMW-1.4 data.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 8, 2021

This PR assumes nltk/nltk_data#175. Otherwise, the tests fail because OMW-1.4 cannot be downloaded.

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

We should merge nltk/nltk_data#175 quickly, and then we can restart the tests for this PR, and get this merged too.

Thanks for jumping on this so quickly @ekaf

@ekaf
Copy link
Contributor Author

ekaf commented Dec 9, 2021

I compared the output of:

for lg in sorted(wn.langs()):
    print(f"{lg}: {len(list(wn.words(lang=lg)))} words in {len(list(wn.all_synsets(lang=lg)))} synsets")

using respectively the old nltk-3.5 (with old omw.zip) vs. nltk-3.6.5 and this PR (with new omw-1.4.zip):

NLTK v. 3.5, Wordnet v. 3.0

	NLTK v. 3.6.5, Wordnet v. 3.0
	
cmn: 61532 words in 42300 synsets
	cmn: 61533 words in 42300 synsets
fas: 17560 words in 17759 synsets
fra: 55350 words in 59091 synsets
	fra: 55351 words in 59091 synsets
	hrv: 29008 words in 23115 synsets
hrv: 29010 words in 23115 synsets
	isl: 11504 words in 4951 synsets
	ita_iwn: 19221 words in 15563 synsets
jpn: 89637 words in 57184 synsets
	jpn: 91964 words in 57184 synsets
	lit: 11395 words in 9462 synsets
por: 54069 words in 43895 synsets
	por: 54071 words in 43895 synsets
qcn: 3206 words in 4913 synsets
	ron: 49987 words in 56026 synsets
	slk: 29150 words in 18507 synsets
	slv: 40230 words in 42583 synsets
slv: 41032 words in 42717 synsets
tha: 80508 words in 73350 synsets
	tha: 82504 words in 73350 synsets

This shows that less than half the languages had a (usually small) increase in the number of lemmas, while the number of synsets stayed the same between the two OMW versions.

@tomaarsen
Copy link
Member

tomaarsen commented Dec 9, 2021

This is going to have cache issues again.
Luckily we can still just test this locally instead, and GitHub allows merging despite failing tests. That said, I'd rather fix the cache issues all-together, but that's an issue for another time.

I've tried only caching if the commit message doesn't contain \nocache, but that won't refresh the cache, it only ignores it for that one commit. I would rather have the opportunity to completely refresh it, but I haven't yet found a consistent way to do that.

@tomaarsen
Copy link
Member

I suspect the secrets trick only works on branches within nltk/nltk. Perhaps @ekaf you could go to https://github.com/ekaf/nltk > Settings > Secrets > Update "CACHE_VERSION" and fill in some arbitrary input. Once you've done that, you can @ me and I'll restart the test (or you can remove the blank line again in a commit, both work).

@ekaf
Copy link
Contributor Author

ekaf commented Dec 9, 2021

@tomaarsen: near the top of that page, it says "Secrets are not passed to workflows that are triggered by a pull request from a fork". So I suspect there will be a better chance if I renew the key in ci.yaml. We can always restore it afterwards.

@tomaarsen
Copy link
Member

tomaarsen commented Dec 9, 2021

@ekaf sounds good. I suppose that makes sense - if PR's were passed secrets, then someone could run a github actions workflow in a PR that just prints the secret in plaintext, which sounds like a giant vulnerability when secrets are used as authentication etc.

@tomaarsen
Copy link
Member

@ekaf omw-1.4 is not mentioned in index.xml, so it can't be downloaded. It's also not included in the all collection. I believe that's why there are issues.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 9, 2021

@tomaarsen, yes, that sounds likely to be the reason. The same applies to wordnet2021, then. I am to going add .xml files for them, run "make pkg_index" and submit a PR in nltk_data.

@tomaarsen
Copy link
Member

@tomaarsen, yes, that sounds likely to be the reason. The same applies to wordnet2021, then. I am to going add .xml files for them, run "make pkg_index" and submit a PR in nltk_data.

@ekaf I'm already on it. I'll tag you in a PR in a few minutes, and you can review it if possible.

@tomaarsen
Copy link
Member

The Ubuntu cache was updated, but the Windows one was not, hence the fail here.
That said, I happen to be able to test Windows locally, and everything passes there (while develop should no longer pass, due to the recent omw changes in nltk_data).

With other words, I have no issues merging this as it stands. On develop I can influence the CACHE_VERSION secret to ensure a new cache.

Thank you for your work in getting us to OMW 1.4!

@tomaarsen tomaarsen merged commit d6d513d into nltk:develop Dec 10, 2021
melonhead901 added a commit to melonhead901/nltk_data-feedstock that referenced this pull request Apr 8, 2022
We really need a new version to pick up changes introduced in nltk/nltk#2907
@ekaf ekaf deleted the omw14_compat branch December 8, 2022 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants