Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade omw to 1.4 #171

Merged
merged 5 commits into from Dec 7, 2021
Merged

Upgrade omw to 1.4 #171

merged 5 commits into from Dec 7, 2021

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Nov 29, 2021

This new version of the omw package contains the multilingual wordnets from the wns folder in the new OMW 1.4 release, retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip

To take advantage of the new features, this package needs upgraded OMW-related functions (nltk/nltk#2899) in NLTK's wordnet.py module.

@ekaf ekaf mentioned this pull request Nov 29, 2021
@goodmami
Copy link

goodmami commented Dec 2, 2021

It's probably not the best idea to just package the wns/ subdirectory from OMW. Specifically:

  • eng is just a tab-formatted export of WordNet 3.0 and should not be included
  • en30, pwn30, and en31 should not be included, and in any case they only package the README, LICENSE, and citation.bib files here. (and pwn30 was supposed to be removed entirely from the omw-data repository)
  • The cldr/ and wikt/ subdirectories are included here, but are you proposing to add those dozens of new languages to the NLTK? These are not even included in OMW 1.4's release.
  • fas/ (Farsi) and cwn/ (Mandarin with Traditional characters; language code qcn) are not included in OMW 1.4's release due to licensing issues, I think. However, both were included in the previous version of OMW and are currently in the NLTK. I'm not sure what should be done here. @fcbond?

The wordnets included in OMW 1.4 that aren't currently in the NLTK are:

  • isl (Icelandic)
  • iwn (alternative Italian)
  • slk (Slovakian; also lit, Lithuanian, in same subdirectory)
  • rom (Romanian)

It might be better to hand select the subdirectories of wns/ that should be included in the NLTK.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 2, 2021

Thanks @goodmami! I completely agree about removing unnecessary subdirectories from the omw package. However, these directories are part of the recent release 1.4 of omw-data (https://github.com/omwn/omw-data), so they were deemed good-to-go by the upstream data provider. Maybe your concerns could be more effectively addressed there.

Looking at the specific details, the four languages mentioned as not included in this package (isl, iwn, slk and rom) are indeed there, as shown in the NLTK output below. It appears that "rom" is called "ron" in OMW-data, and if this is a problem, it should be addressed upstream.

Assuming the updated OMW functions from the wordnet.py PR #nltk/nltk#2899:

import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1

print(wn.langs())
dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'qcn', 'dan', 'ell', 'eng_eng', 'fas', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])

ss = wn.synset('dog.n.01')
for lg in ["isl","ita_iwn","ron","slk"]:
    print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")

isl lemmas:[Lemma('dog.n.01.rakki'), Lemma('dog.n.01.seppi'), Lemma('dog.n.01.hundur'), Lemma('dog.n.01.hvutti')]
ita_iwn lemmas:[Lemma('dog.n.01.cane')]
slk lemmas:[Lemma('dog.n.01.pes'), Lemma('dog.n.01.pes_domáci_(Canis_familiaris)')]
ron lemmas:[Lemma('dog.n.01.câine')]

I agree that the licensing terms of the Farsi Wordnet could be clearer:
print(wn.license(lang='fas'))

mortaza montazery mortaza.gh@gmail.com 9 Jan
to Francis

This version of Persian WordNet is the result of my research on
automatic construction of WordNet. It's free and you can use it
everywhere you want.

But here again, upstream has found these terms ok. Otherwise, you could consider having this language removed upstream.

Concerning the 'cwn/', it contains the same open-source license as PWN, so where is the problem?

The cldr/ and wikt/ subdirectories do not contain any wn-data-lang.tab file, so they are just ignored by the wordnet module. However, especially the wordnets extracted from Wikipedia seem very interesting,
and ok to include as experimental data for the more curious users.

In conclusion, it is indeed possible to manually remove some subdirectories, but these do no harm, as they are correctly handled (or ignored) by the wordnet.py PR #nltk/nltk#2899.

I propose to eventually remove "eng", "en30", "en31", "pwn31", (and maybe "fas"), but to keep "cwn", "cldr" and "wikt".

@fcbond
Copy link
Contributor

fcbond commented Dec 2, 2021 via email

@goodmami
Copy link

goodmami commented Dec 2, 2021

However, these directories are part of the recent release 1.4 of omw-data (https://github.com/omwn/omw-data), so they were deemed good-to-go by the upstream data provider.

When Francis and I created the release on GitHub, GitHub automatically included the full repository source as .zip and .tar.gz files attached to the release (I'm not sure how to prevent these files from being created). Please don't take those files to be the contents of the release. Instead look at the wordnets specifically listed here: https://github.com/omwn/omw-data/releases/tag/v1.4.

Looking at the specific details, the four languages mentioned as not included in this package (isl, iwn, slk and rom) are indeed there, as shown in the NLTK output below. It appears that "rom" is called "ron" in OMW-data, and if this is a problem, it should be addressed upstream.

Sorry, my mistake regarding rom; ron is correct. Regarding the presence of these languages, perhaps you are using your development version of nltk_data to inspect this, because I don't see them:

>>> from nltk.corpus import wordnet as wn
>>> wn.langs()
['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']
>>> wn.synsets('plantă', lang='ron')
Traceback (most recent call last):
...
nltk.corpus.reader.wordnet.WordNetError: Language is not supported.

I also downloaded the omw.zip file from this repository and I don't see the four languages in the archive.

The cldr/ and wikt/ subdirectories do not contain any wn-data-lang.tab
file, so they are just ignored by the wordnet module. However, especially
the wordnets extracted from Wikipedia seem very interesting,
and ok to include as experimental data for the more curious users.

I am fine with that.

Ok, but if they are just ignored, then what's the benefit? It will increase the disk/bandwidth requirements for omw.zip and users won't be able to see the data unless they dig around their nltk_data folder. Why not just direct those curious users to the omw-data repository? However, I do see .tab files under cldr/ and wikt/, so I'm a bit confused by your statement that there are none.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 2, 2021

@goodmami, there are tab files in cldr and wikt, but wordnet.py ignores them because they are named with a different pattern than wn-data-*.tab. The wiktionary data seems interesting, @fcbond is "fine with that", and 26 MB is not an overwhelming amount nowadays. But I agree that removing these files would yield a leaner package. So, if you prefer, I could remove cldr and wikt from the proposed nltk_data package, in addition to the following:

pwn31 has licensing issues so should be removed, and eng, en30 and en31 are
pointless to include. 'fas' and 'cwn' should be removed because of the
quality issues.

@goodmami
Copy link

goodmami commented Dec 2, 2021

there are tab files in cldr and wikt, but wordnet.py ignores them because they are named with a different pattern than wn-data-*.tab

Ok thanks, I see the issue now.

So, if you prefer, I could remove cldr and wikt from the proposed nltk_data package, in addition to the following: ...

The extra bits wouldn't bother me as much if they were usable by the NLTK. I think the filename issue is easily resolved with some changes to the NLTK code, but otherwise I would prefer to exclude the data if they are simply dead weight in the package. A second issue is that the Wiktionary and CLDR files are not in fact part of the OMW 1.4 release (they were not packaged as WN-LMF files). It might make sense to put them in a separate package, e.g., omw-extra. But I'll leave the issue of their inclusion up to you and @fcbond.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 3, 2021

Thanks @goodmami and @fcbond! I have updated the package according to all your suggestions.

Here are new statistics, using OEWN 2021 and the updated wordnet.py (nltk/nltk#2899):

import nltk
from nltk.corpus import wordnet2021 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 2021

for lg in sorted(wn.langs()):
    print(f"{lg}: {len(list(wn.words(lang=lg)))} words")

als: 5988 words
arb: 17785 words
bul: 6720 words
cat: 46531 words
cmn: 61533 words
dan: 4468 words
ell: 18225 words
eng: 151770 words
eus: 26240 words
fin: 129839 words
fra: 55351 words
glg: 23124 words
heb: 5325 words
hrv: 29008 words
ind: 36954 words
isl: 11504 words
ita: 41855 words
ita_iwn: 19221 words
jpn: 91964 words
lit: 11395 words
nld: 43077 words
nno: 3387 words
nob: 4186 words
pol: 45387 words
por: 54071 words
ron: 49987 words
slk: 29150 words
slv: 40230 words
spa: 36681 words
swe: 5824 words
tha: 82504 words
zsm: 33932 words

@ekaf ekaf closed this Dec 3, 2021
@ekaf
Copy link
Contributor Author

ekaf commented Dec 3, 2021

This allows interesting comparisons with the Wn module. Having two different modules that accomplish the same tasks could make verification and debugging easier.

@ekaf ekaf reopened this Dec 3, 2021
@goodmami
Copy link

goodmami commented Dec 3, 2021

Thanks, @ekaf, that looks better. A couple other things I noticed, in comparison to the current package distributed by the NLTK:

  • The NLTK does not distribute the Python and Perl scripts, log files, etc. originally used in creating the tab files. Rather than excluding certain files, it might be easier to only allow certain files, namely README, LICENSE, citation.bib, and wn-data-....tab.
  • The NLTK currently has mcr split into separate directories (cat, eus, glg, spa), and similarly for msa (ind, zsm). It looks like the code is able to find the appropriate files regardless, but maybe you want to keep things consistent? In that case, the new slk folder (with lit and slk) should be treated the same.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 3, 2021

@goodmami it sounds like you are looking at the old omw package instead of the new one proposed in this PR.

@goodmami
Copy link

goodmami commented Dec 3, 2021

I was comparing the zip file in your fork (https://raw.githubusercontent.com/ekaf/nltk_data/omw14/packages/corpora/omw.zip) to what I currently have in ~/nltk_data/corpora/omw/. For example (omw14 below is from the omw.zip in your fork):

$ ls omw14/arb/
arb2tab.py  awnCopyright.pdf  citation.bib  LICENSE  README  wn-data-arb.tab  wn-nodia-arb.tab
$ ls nltk_data/corpora/omw/arb/
citation.bib  LICENSE  README  wn-data-arb.tab

Please note that the wns/ directory in the omw-data repository is not a ready-to-go dataset, but rather it contains the source files used in creating the OMW 1.4 release (which for us is the XML files), as well as the historical artifacts of the creation of those tab files. My suggestion is to not include anything but citation.bib, LICENSE, README, and wn-data-....tab, as the other files are irrelevant for NLTK users and were not included in the initial and current NLTK package.

I'm ambivalent about splitting out the directories with multiple .tab files, like mcr. They were split previously, but it's not a big deal to remain together if they can still be loaded in the NLTK. If they are split, the metadata (LICENSE, README, citation.bib) should be copied to each subdirectory, as is done currently:

$ ls omw14/mcr/
citation.bib  LICENSE  mcr2tab.py  wn-data-cat.tab  wn-data-eus.tab  wn-data-glg.tab  wn-data-spa.tab
$ ls nltk_data/corpora/omw/cat/
citation.bib  LICENSE  wn-data-cat.tab

@ekaf
Copy link
Contributor Author

ekaf commented Dec 4, 2021

@goodmami, I don't understand why you mention splitting up the mcr folder. This was done previously, but it is not split in the new omw14 package, since this is just identical with the 'wns' folder in the original 1.4 release of OMW-data (https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip), except that the agreed problematic subdirectories have been removed.

As you write,"wns" is the source directory, so it is just what we need for an open-source project. The intent of this PR is not to duplicate the LMF release, but to give users of NLTK's current omw package access to the updated OMW data.

Concerning the two supplementary files in the 'arb' folder, they really are no big deal, and might interest some users. Otherwise, why did you include them in your OMW 1.4 package? Maybe you could consider releasing an upstream package that better conforms with your wishes.

@fcbond
Copy link
Contributor

fcbond commented Dec 4, 2021

G'day,

The scripts for making the wordnets are there for people who want to make more (or extract more), not the average nltk user.   I don't think they should be packaged up as data. People who want to see them should be downloading the whole omw-data repository.

Here is my (out of date) script for packaging the NLTK release which just copies the files I consider relevant, and does a little clean up of the data. It does not expect two wordnets for the same language, so loses an Italian wordnet, ...

I stopped maintaining it as I think my time is better spent getting the new version of OMW working, but I agree is good to keep the current version as u-to-date as possible, so thanks!

The URL for the project should change to https://omwn.org/, as compling will be retired soon. I plan to start making the new site useful from next week.

#! /bin/bash

RELEASE=release/nltk
OMW="$RELEASE"/omw

echo Copying files to nltkdata.zip
rm -rf "$RELEASE"

mkdir -p "$OMW"
cp wns/citation.bib "$OMW"/. 
#cp www/LICENSE "$OMW"/.
cat wns/eng/citation.bib >> "$OMW"/citation.bib

for f in `ls wns/*/wn-data*.tab` 
do
    drsn=`dirname $f`
    proj=`basename $drsn`
    file=`basename $f .tab`
     ### EXCLUDING WIKT, CLDR and English Data
    if [[ $proj != wikt &&   $proj != cldr && $proj != eng ]]
    then
	lang=${file:8}    
	#echo $proj $lang $file
	### copy stuff
	echo adding ${lang} from ${proj} to zip
	echo making dir
 	mkdir -p "$OMW"/${lang}
	echo copy data
	### copy and keep only first line
	head -1 ${f} > "$OMW"/${lang}/${file}.tab
	### copy lemmas, sort and keep only unique
	grep -P "^[^\t]+\t[^\t]*lemma\t[^\t]+$" ${f}  | sort -u  \
	    >> "$OMW"/${lang}/${file}.tab
	cp wns/$proj/LICENSE "$OMW"/${lang}/.
	cp wns/$proj/citation.bib  "$OMW"/${lang}/.
	cat wns/$proj/citation.bib  >> "$OMW"/citation.bib
	cp wns/$proj/README "$OMW"/${lang}/README
	cp wns/$proj/Readme "$OMW"/${lang}/README
    fi
done

wns=`ls "$OMW"/*/wn-data*.tab | cut -d/  -f 3`
numwns=`ls "$OMW"/*/wn-data*.tab | cut -d/  -f 3 | wc -l`
today=`date --rfc-3339=date`
cat <<EOT >> "$OMW"/README
This data is formatted by the Open Multilingual Wordnet Project

to be used by NLTK.

Please cite us if you find the aggregation useful (see citation.bib)
and email us if you have any suggestions.

Francis Bond (bond@ieee.org)
${today}


EOT
echo "${numwns} languages covered (and we assume you have English):" >> "$OMW"/README
echo  "${wns} "  >> "$OMW"/README
echo  >> "$OMW"/README

pushd "$RELEASE"
zip -r9 omw.zip omw/*
#rm -rf omw
popd

echo "omw.zip created in $RELEASE"

@goodmami
Copy link

goodmami commented Dec 4, 2021

...since this is just identical with the 'wns' folder in the original 1.4 release of OMW-data (https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip)...

It seems there is still some confusion about what the 1.4 release is. As I mentioned above, the .zip file you're looking at is not the release. It is simply an archive, automatically created by GitHub, of the state of the repository when we created the release. I would remove that file from the release page if I could. Since I cannot, I have now updated the text on the release page to make this more clear.

Maybe you could consider releasing an upstream package that better conforms with your wishes.

We have. It contains the WN-LMF files 😉 https://github.com/omwn/omw-data/releases/download/v1.4/omw-1.4.tar.xz

Again, thanks for your efforts to update the OMW data in the NLTK. Hopefully it's now more clear which files we consider user-facing and which are just meant for OMW maintainers. It sounds like Francis and I should update the omw-data project README, as well.

@ekaf
Copy link
Contributor Author

ekaf commented Dec 4, 2021

Thanks for your packaging script @fcbond: it was very useful as a starting point. However, it doesn't handle multiple wordnets for the same language. For this it is necessary to differentiate between language and project names, so the new directory structure is more future-proof. Also, some wordnets now include definition and example triples. So I rebuilt omw.zip using an adapted script.

Thanks for your patience @goodmami!

Edited: corrected a bug in the packaging script, that duplicated the first line in each wordnet. Here is the new script:
Omw2nltk.sh.txt

@fcbond
Copy link
Contributor

fcbond commented Dec 6, 2021

Great,

glad my script helped @ekaf

@stevenbird
Copy link
Member

@ekaf, @fcbond, @goodmami – it looks like you have consensus now... am I good to merge this?

@ekaf
Copy link
Contributor Author

ekaf commented Dec 6, 2021

@stevenbird, I also think it looks ok now

@fcbond
Copy link
Contributor

fcbond commented Dec 6, 2021 via email

@goodmami
Copy link

goodmami commented Dec 6, 2021

I inspected the latest omw.zip from @ekaf's fork, and it looks good to me!

Thanks, all!

@stevenbird stevenbird merged commit 8d83662 into nltk:gh-pages Dec 7, 2021
@stevenbird
Copy link
Member

Thanks @ekaf, @fcbond, @goodmami!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants