New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade omw to 1.4 #171
Upgrade omw to 1.4 #171
Conversation
It's probably not the best idea to just package the
The wordnets included in OMW 1.4 that aren't currently in the NLTK are:
It might be better to hand select the subdirectories of |
Thanks @goodmami! I completely agree about removing unnecessary subdirectories from the omw package. However, these directories are part of the recent release 1.4 of omw-data (https://github.com/omwn/omw-data), so they were deemed good-to-go by the upstream data provider. Maybe your concerns could be more effectively addressed there. Looking at the specific details, the four languages mentioned as not included in this package (isl, iwn, slk and rom) are indeed there, as shown in the NLTK output below. It appears that "rom" is called "ron" in OMW-data, and if this is a problem, it should be addressed upstream. Assuming the updated OMW functions from the wordnet.py PR #nltk/nltk#2899:
Wordnet v. 3.1
isl lemmas:[Lemma('dog.n.01.rakki'), Lemma('dog.n.01.seppi'), Lemma('dog.n.01.hundur'), Lemma('dog.n.01.hvutti')] I agree that the licensing terms of the Farsi Wordnet could be clearer:
But here again, upstream has found these terms ok. Otherwise, you could consider having this language removed upstream. Concerning the 'cwn/', it contains the same open-source license as PWN, so where is the problem? The In conclusion, it is indeed possible to manually remove some subdirectories, but these do no harm, as they are correctly handled (or ignored) by the wordnet.py PR #nltk/nltk#2899. I propose to eventually remove "eng", "en30", "en31", "pwn31", (and maybe "fas"), but to keep "cwn", "cldr" and "wikt". |
Hi,
I think @goodmami and I are equally upstream in this case.
On Thu, Dec 2, 2021 at 3:18 PM Eric Kafe ***@***.***> wrote:
Thanks @goodmami <https://github.com/goodmami>! I completely agree about
removing unnecessary subdirectories from the omw package. However, these
directories are part of the recent release 1.4 of omw-data (
https://github.com/omwn/omw-data), so they were deemed good-to-go by the
upstream data provider. Maybe your concerns could be more effectively
addressed there.
The tab files in questions are kept in the repository in case people want
to access them for historical purposes, but they are not recommended to
use, and are not released as LMF:
https://github.com/omwn/omw-data/releases/tag/v1.4
Looking at the specific details, the four languages mentioned as not
included in this package (isl, iwn, slk and rom) are indeed there, as shown
in the NLTK output below. It appears that "rom" is called "ron" in
OMW-data, and if this is a problem, it should be addressed upstream.
Assuming the updated OMW functions from the wordnet.py PR #nltk/nltk#2899
<nltk/nltk#2899>:
import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")
Wordnet v. 3.1
print(wn.langs())
dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'qcn', 'dan', 'ell',
'eng_eng', 'fas', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn',
'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob',
'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])
ss = wn.synset('dog.n.01')
for lg in ["isl","ita_iwn","ron","slk"]:
print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")
isl lemmas:[Lemma('dog.n.01.rakki'), Lemma('dog.n.01.seppi'),
Lemma('dog.n.01.hundur'), Lemma('dog.n.01.hvutti')]
ita_iwn lemmas:[Lemma('dog.n.01.cane')]
slk lemmas:[Lemma('dog.n.01.pes'),
Lemma('dog.n.01.pes_domáci_(Canis_familiaris)')]
ron lemmas:[Lemma('dog.n.01.câine')]
I agree that the licensing terms of the Farsi Wordnet could be clearer:
print(wn.license(lang='fas'))
mortaza montazery ***@***.*** 9 Jan
to Francis
This version of Persian WordNet is the result of my research on
automatic construction of WordNet. It's free and you can use it
everywhere you want.
But here again, upstream has found these terms ok. Otherwise, you could
consider having this language removed upstream.
Concerning the 'cwn/', it contains the same open-source license as PWN, so
where is the problem?
Native speakers of Farsi and Mandarin have pointed out that these two
resources have some quality issues, as well as the licensing issues for
Farsi, which is why we have removed them from the recommended release.
Please do not export them to NLTK.
The cldr/ and wikt/ subdirectories do not contain any wn-data-lang.tab
file, so they are just ignored by the wordnet module. However, especially
the wordnets extracted from Wikipedia seem very interesting,
and ok to include as experimental data for the more curious users.
I am fine with that.
In conclusion, it is indeed possible to manually remove some
subdirectories, but these do no harm, as they are correctly handled (or
ignored) by the wordnet.py PR #nltk/nltk#2899
<nltk/nltk#2899>.
I propose to eventually remove "eng", "en30", "en31", "pwn31", (and maybe
"fas"), but to keep "cwn", "cldr" and "wikt".
pwn31 has licensing issues so should be removed, and eng, en30 and en31 are
pointless to include. 'fas' and 'cwn' should be removed because of the
quality issues.
Yours,
…--
Francis Bond <https://fcbond.github.io/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
When Francis and I created the release on GitHub, GitHub automatically included the full repository source as
Sorry, my mistake regarding >>> from nltk.corpus import wordnet as wn
>>> wn.langs()
['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']
>>> wn.synsets('plantă', lang='ron')
Traceback (most recent call last):
...
nltk.corpus.reader.wordnet.WordNetError: Language is not supported. I also downloaded the
Ok, but if they are just ignored, then what's the benefit? It will increase the disk/bandwidth requirements for |
@goodmami, there are tab files in cldr and wikt, but wordnet.py ignores them because they are named with a different pattern than wn-data-*.tab. The wiktionary data seems interesting, @fcbond is "fine with that", and 26 MB is not an overwhelming amount nowadays. But I agree that removing these files would yield a leaner package. So, if you prefer, I could remove cldr and wikt from the proposed nltk_data package, in addition to the following:
|
Ok thanks, I see the issue now.
The extra bits wouldn't bother me as much if they were usable by the NLTK. I think the filename issue is easily resolved with some changes to the NLTK code, but otherwise I would prefer to exclude the data if they are simply dead weight in the package. A second issue is that the Wiktionary and CLDR files are not in fact part of the OMW 1.4 release (they were not packaged as WN-LMF files). It might make sense to put them in a separate package, e.g., |
Thanks @goodmami and @fcbond! I have updated the package according to all your suggestions. Here are new statistics, using OEWN 2021 and the updated wordnet.py (nltk/nltk#2899):
Wordnet v. 2021
als: 5988 words |
This allows interesting comparisons with the Wn module. Having two different modules that accomplish the same tasks could make verification and debugging easier. |
Thanks, @ekaf, that looks better. A couple other things I noticed, in comparison to the current package distributed by the NLTK:
|
@goodmami it sounds like you are looking at the old omw package instead of the new one proposed in this PR. |
I was comparing the zip file in your fork (https://raw.githubusercontent.com/ekaf/nltk_data/omw14/packages/corpora/omw.zip) to what I currently have in $ ls omw14/arb/
arb2tab.py awnCopyright.pdf citation.bib LICENSE README wn-data-arb.tab wn-nodia-arb.tab
$ ls nltk_data/corpora/omw/arb/
citation.bib LICENSE README wn-data-arb.tab Please note that the I'm ambivalent about splitting out the directories with multiple $ ls omw14/mcr/
citation.bib LICENSE mcr2tab.py wn-data-cat.tab wn-data-eus.tab wn-data-glg.tab wn-data-spa.tab
$ ls nltk_data/corpora/omw/cat/
citation.bib LICENSE wn-data-cat.tab |
@goodmami, I don't understand why you mention splitting up the mcr folder. This was done previously, but it is not split in the new omw14 package, since this is just identical with the 'wns' folder in the original 1.4 release of OMW-data (https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip), except that the agreed problematic subdirectories have been removed. As you write,"wns" is the source directory, so it is just what we need for an open-source project. The intent of this PR is not to duplicate the LMF release, but to give users of NLTK's current omw package access to the updated OMW data. Concerning the two supplementary files in the 'arb' folder, they really are no big deal, and might interest some users. Otherwise, why did you include them in your OMW 1.4 package? Maybe you could consider releasing an upstream package that better conforms with your wishes. |
G'day, The scripts for making the wordnets are there for people who want to make more (or extract more), not the average nltk user. I don't think they should be packaged up as data. People who want to see them should be downloading the whole Here is my (out of date) script for packaging the NLTK release which just copies the files I consider relevant, and does a little clean up of the data. It does not expect two wordnets for the same language, so loses an Italian wordnet, ... I stopped maintaining it as I think my time is better spent getting the new version of OMW working, but I agree is good to keep the current version as u-to-date as possible, so thanks! The URL for the project should change to https://omwn.org/, as compling will be retired soon. I plan to start making the new site useful from next week.
|
It seems there is still some confusion about what the 1.4 release is. As I mentioned above, the
We have. It contains the WN-LMF files 😉 https://github.com/omwn/omw-data/releases/download/v1.4/omw-1.4.tar.xz Again, thanks for your efforts to update the OMW data in the NLTK. Hopefully it's now more clear which files we consider user-facing and which are just meant for OMW maintainers. It sounds like Francis and I should update the omw-data project README, as well. |
Thanks for your packaging script @fcbond: it was very useful as a starting point. However, it doesn't handle multiple wordnets for the same language. For this it is necessary to differentiate between language and project names, so the new directory structure is more future-proof. Also, some wordnets now include definition and example triples. So I rebuilt omw.zip using an adapted script. Thanks for your patience @goodmami! Edited: corrected a bug in the packaging script, that duplicated the first line in each wordnet. Here is the new script: |
Great, glad my script helped @ekaf |
@stevenbird, I also think it looks ok now |
Yes I think it's ready to go.
…On Mon, 6 Dec 2021, 7:12 pm Eric Kafe, ***@***.***> wrote:
@stevenbird <https://github.com/stevenbird>, I also think it looks ok now
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#171 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRQIZLGQA7ILOBGSN4TUPSLBTANCNFSM5I7EDYPA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I inspected the latest Thanks, all! |
This new version of the omw package contains the multilingual wordnets from the wns folder in the new OMW 1.4 release, retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip
To take advantage of the new features, this package needs upgraded OMW-related functions (nltk/nltk#2899) in NLTK's wordnet.py module.