Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support both iso639-3 codes and BCP-47 language tags #3060

Merged
merged 39 commits into from
Dec 7, 2022

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Oct 5, 2022

Fix #3058 by adding a function to obtain language names, using their respective iso639-3 language codes.

This PR adds a langname.py module in the nltk directory. I plan to use it with wordnet.py, to obtain the names of the multilingual wordnets from OMW.

>>>from nltk.langnames import langname
>>>for code in ["cat", "wln"]:
>>>    print(f"{code}: {langname(code)}")

cat: Catalan
wln: Walloon

The language codes were obtained from https://iso639-3.sil.org/code_tables/download_tables
and since it is not clear whether this use is authorized, it seems safer to ask them:

For any questions about whether a particular use is covered by these guidelines, contact the Registration Authority at ISO639-3@sil.org.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 5, 2022

@tomaarsen, curiously CI only fails with Python 3.9 on Ubuntu, but succeeds on the other platforms. The failure seems unrelated with this PR, since all the errors concern nltk/parse/corenlp.py:

grep -i err 'Python 3.9 on ubuntu-latest/9_Run pytest.txt'

2022-10-05T11:48:57.8556549Z ==================================== ERRORS ====================================
2022-10-05T11:48:57.8557014Z _______________ ERROR at setup of TestTokenizerAPI.test_tokenize _______________
2022-10-05T11:48:57.8597206Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8599517Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8601953Z _______________ ERROR at setup of TestTaggerAPI.test_ner_tagger ________________
2022-10-05T11:48:57.8635756Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8636117Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8657887Z _______________ ERROR at setup of TestTaggerAPI.test_pos_tagger ________________
2022-10-05T11:48:57.8739007Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8741508Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8744008Z ___________ ERROR at setup of TestTaggerAPI.test_unexpected_tagtype ____________
2022-10-05T11:48:57.8766761Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8769194Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8769611Z ____________ ERROR at setup of TestParserAPI.test_dependency_parser ____________
2022-10-05T11:48:57.8823049Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8881010Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8886635Z __________________ ERROR at setup of TestParserAPI.test_parse __________________
2022-10-05T11:48:57.8976601Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8981665Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.9228352Z = 697 passed, 26 skipped, 9 xfailed, 18 warnings, 6 errors in 127.83s (0:02:07) =
2022-10-05T11:48:58.0065390Z ##[error]Process completed with exit code 1.



@tomaarsen
Copy link
Member

I'm noticing the test failures. Out of Windows, Mac and Ubuntu, only the Ubuntu test setup prepares the download of third-party tools, including CoreNLP. The Windows and Mac suites simply skip these tests if it is discovered that the relevant jar files are not present.

The issue seems to indicate that either:

  • The CoreNLPServer instance does not always correctly terminate, causing future tests to fail when they try to claim port 9000.
  • The CI runner attempts to run tests in parellel in such a way that two separate instances of CoreNLPServer are attempted to be started on 9000 at the same time.

I'll be looking into this issue, and otherwise I'll simply set these 6 tests to always be skipped. The test failures can be ignored in this PR.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 5, 2022

@tomaarsen, ok but it still seems mysterious that Ubuntu only fails with Python 3.9, and succeeds with the other Python versions.

@tomaarsen
Copy link
Member

They fail arbitrarily. Initially, they failed for all 4 versions. Then, I restarted the failing ones, and 3 of the 4 passed, leaving only the Python 3.9 one to fail.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 5, 2022

Some language codes used by the Extended OMW were retired in the latest version of the iso639-3 standard. Now, this PR also supports the retired codes, so we can get a language name for all the wordnets in OMW-1.4 plus the Extended OMW (1222 wordnets, slightly fewer languages):


from nltk.corpus import wordnet as wn
from nltk.langnames import langname

wn.add_omw()
wn.add_exomw()

for lang in wn.langs():
    code = lang.split("_")[0]
    print(f"{lang}: {langname(code)}")

eng: English
als: Tosk Albanian
arb: Standard Arabic
bul: Bulgarian
cmn: Mandarin Chinese
[...]
zro_wikt: Záparo
zsm_wikt: Standard Malay
zul_wikt: Zulu
zun_wikt: Zuni
zza_wikt: Zaza

@stevenbird
Copy link
Member

I think generic functionality like language identity belongs in a top-level module.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 6, 2022

The hardest part may be to obtain approval from iso639-3.sil.org.
I can ask them, unless somebody else wants to.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 7, 2022

An alternative could be to ask users to download the data from iso639-3.sil.org, but this seems more cumbersome, since it would require checking at each call that the functionality is available. Any suggestions?

@tomaarsen
Copy link
Member

This functionality seems like a convenience function, that someone should be able to quickly import and run a language identifier through. If the user would have to first download some file from iso639-3.sil.org, then it'll be considerably easier for them to just use their search engine of choice to quickly find out what the language code means.
So I would rather not go that route.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 7, 2022

This use of the iso639-3 data seems to conform to the registration authority's guidelines. Here's their response:

From: Registrar ISO639-3 iso639-3@sil.org
Date: Fri, 7 Oct 2022 11:50:45 -0500
Subject: Re: Terms of use

Dear Eric,

Your use described in this proposal appears to conform to our terms of use
https://iso639-3.sil.org/code_tables/download_tables#termsofuse. You may
request to be informed
https://iso639-3.sil.org/code_changes/requesting_notification_of_changes
about updates to the download tables, as we do not have an API, and there
are updates to the tables posted annually. In addition, we anticipate that
changes be approved on a more frequent basis starting in the next year.

Kind regards,

Janell

@ekaf
Copy link
Contributor Author

ekaf commented Oct 11, 2022

Parenthesized text in the language names should not be discarded, because it would make the iso639name dictionary surjective, thus breaking the bijectivity of the mapping. This also happens if the retired codes are added to the same dictionary.

The problem is fixed now, by keeping the retired codes in a separate iso639retired dictionary, and also keeping parenthesized text.

@ekaf
Copy link
Contributor Author

ekaf commented Oct 12, 2022

Although each dictionary here is a bijective mapping, the wrapper functions, which combine a main dictionary with one for retired codes, deviate from bijectivity in the cases where the code for a particular language is retired and replaced by a new code for the same language.
In these cases, the langcode() wrapper function return only the new iso639-3 code, not the retired code:

import nltk.langnames as lgn

for (ret_code, ret_name) in lgn.iso639retired.items():
    code = lgn.langcode(ret_name)
    if ret_code != code:
        print(f"{ret_code}:{code}:{ret_name}")

fri:fry:Western Frisian
cit:ctg:Chittagonian
flm:cfm:Falam Chin
nhs:npl:Southeastern Puebla Nahuatl
aiz:aiw:Aari
azr:adz:Adzera
bcx:pmf:Pamona
bii:bzi:Bisu
blu:hnj:Hmong Njua
fiz:izr:Izere
kds:lhi:Lahu Shi
mdo:gso:Southwest Gbaya
nky:kix:Khiamniungan Naga
ork:okv:Orokaiva
rjb:rjs:Rajbanshi
suf:tpf:Tarpia
suh:sxb:Suba
mly:zlm:Malay (individual language)
muw:unr:Mundari
xst:stv:Silt'e
scc:srp:Serbian
scr:hrv:Croatian
agp:prf:Paranan
rmr:rmq:Caló
sul:sgd:Surigaonon
wgw:wgb:Wagawaga
bjq:bzc:Southern Betsimisaraka Malagasy
nbf:nxq:Naxi
baz:tvu:Tunen
kpp:jkp:Paku Karen
wiw:wgu:Wirangu
yen:ynq:Yendang
daf:dnj:Dan
djl:dze:Djiwarli
nlr:nrk:Ngarla
wit:wnw:Wintu
yiy:yyr:Yir Yoront
sap:spn:Sanapaná
duj:dwu:Dhuwal
kjf:klj:Khalaj
kxu:uki:Kui (India)
sdm:sdq:Semandang
gji:gyz:Geji
lno:lgo:Lango (South Sudan)
wya:wyn:Wyandot

@ekaf
Copy link
Contributor Author

ekaf commented Oct 28, 2022

@tomaarsen, I noticed that I had committed some unwanted files, so I reset this branch, which may accidentally have overwritten your upgrade of pyupgrade (I am not too sure of this). Anyway, the error seems the same as before:

2022-10-28T10:38:43.9604962Z Traceback (most recent call last):
2022-10-28T10:38:43.9605731Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/bin/pyupgrade", line 5, in <module>
2022-10-28T10:38:43.9606077Z     from pyupgrade._main import main
2022-10-28T10:38:43.9606652Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_main.py", line 30, in <module>
2022-10-28T10:38:43.9607042Z     from pyupgrade._data import FUNCS
2022-10-28T10:38:43.9607610Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_data.py", line 126, in <module>
2022-10-28T10:38:43.9607974Z     _import_plugins()
2022-10-28T10:38:43.9608557Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_data.py", line 123, in _import_plugins
2022-10-28T10:38:43.9609015Z     __import__(name, fromlist=['_trash'])
2022-10-28T10:38:43.9609601Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_plugins/pep584.py", line 5, in <module>
2022-10-28T10:38:43.9610006Z     from tokenize_rt import List
2022-10-28T10:38:43.9610587Z ImportError: cannot import name 'List' from 'tokenize_rt' (/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/tokenize_rt.py)
2022-10-28T10:38:43.9610895Z 
2022-10-28T10:39:06.1911460Z black....................................................................�[42mPassed�[m
2022-10-28T10:39:07.6552814Z isort....................................................................�[42mPassed�[m
2022-10-28T10:39:07.6773891Z ##[error]The process '/opt/hostedtoolcache/Python/3.10.8/x64/bin/pre-commit' failed with exit
``` code 1

@ekaf
Copy link
Contributor Author

ekaf commented Nov 3, 2022

Added a bcp47 CorpusReader, which needs the new bcp47
nltk_data package (nltk/nltk_data#191).

Iso 639-3 is now supported in langnames_py through bcp47, so all doctests fail if the bcp47 data package is missing.

@ekaf ekaf changed the title (Draft) Add support for iso639-3 language codes Support both iso639-3 codes and BCP-47 language tags Nov 3, 2022
@github-actions github-actions bot added the corpus label Dec 6, 2022
Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekaf

I've made some minor changes and one fix. The tests pass for me (locally), so I'm nearing the point where I'm ready to merge, although I do have some changes that still need doing at nltk/nltk_data#191.

That said, I do have some questions, see my comments. This is looking good, though! Glad we want the NLTK data route.

nltk/corpus/reader/bcp47.py Outdated Show resolved Hide resolved
nltk/corpus/reader/bcp47.py Show resolved Hide resolved
@ekaf
Copy link
Contributor Author

ekaf commented Dec 7, 2022

Thanks for all your changes @tomaarsen, and in particular for catching the nasty confusion between 'tag' and 'subtag'.

@github-actions github-actions bot removed the wordnet label Dec 7, 2022
@tomaarsen tomaarsen merged commit f019fbe into nltk:develop Dec 7, 2022
@tomaarsen
Copy link
Member

Wonderful! Thank you for this!

@ekaf ekaf deleted the langnames branch December 8, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translating between language names and iso-639 codes
3 participants