Support both iso639-3 codes and BCP-47 language tags #3060

ekaf · 2022-10-05T10:48:24Z

Fix #3058 by adding a function to obtain language names, using their respective iso639-3 language codes.

This PR adds a langname.py module in the nltk directory. I plan to use it with wordnet.py, to obtain the names of the multilingual wordnets from OMW.

>>>from nltk.langnames import langname
>>>for code in ["cat", "wln"]:
>>>    print(f"{code}: {langname(code)}")

cat: Catalan
wln: Walloon

The language codes were obtained from https://iso639-3.sil.org/code_tables/download_tables
and since it is not clear whether this use is authorized, it seems safer to ask them:

For any questions about whether a particular use is covered by these guidelines, contact the Registration Authority at ISO639-3@sil.org.

This reverts commit f2b9ce4.

This reverts commit cbfc2ec.

This reverts commit 46fb31b.

ekaf · 2022-10-05T12:12:51Z

@tomaarsen, curiously CI only fails with Python 3.9 on Ubuntu, but succeeds on the other platforms. The failure seems unrelated with this PR, since all the errors concern nltk/parse/corenlp.py:

grep -i err 'Python 3.9 on ubuntu-latest/9_Run pytest.txt'

2022-10-05T11:48:57.8556549Z ==================================== ERRORS ====================================
2022-10-05T11:48:57.8557014Z _______________ ERROR at setup of TestTokenizerAPI.test_tokenize _______________
2022-10-05T11:48:57.8597206Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8599517Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8601953Z _______________ ERROR at setup of TestTaggerAPI.test_ner_tagger ________________
2022-10-05T11:48:57.8635756Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8636117Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8657887Z _______________ ERROR at setup of TestTaggerAPI.test_pos_tagger ________________
2022-10-05T11:48:57.8739007Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8741508Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8744008Z ___________ ERROR at setup of TestTaggerAPI.test_unexpected_tagtype ____________
2022-10-05T11:48:57.8766761Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8769194Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8769611Z ____________ ERROR at setup of TestParserAPI.test_dependency_parser ____________
2022-10-05T11:48:57.8823049Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8881010Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.8886635Z __________________ ERROR at setup of TestParserAPI.test_parse __________________
2022-10-05T11:48:57.8976601Z E       OSError: [Errno 98] Address already in use
2022-10-05T11:48:57.8981665Z nltk/parse/corenlp.py:32: OSError
2022-10-05T11:48:57.9228352Z = 697 passed, 26 skipped, 9 xfailed, 18 warnings, 6 errors in 127.83s (0:02:07) =
2022-10-05T11:48:58.0065390Z ##[error]Process completed with exit code 1.

tomaarsen · 2022-10-05T12:21:21Z

I'm noticing the test failures. Out of Windows, Mac and Ubuntu, only the Ubuntu test setup prepares the download of third-party tools, including CoreNLP. The Windows and Mac suites simply skip these tests if it is discovered that the relevant jar files are not present.

The issue seems to indicate that either:

The CoreNLPServer instance does not always correctly terminate, causing future tests to fail when they try to claim port 9000.
The CI runner attempts to run tests in parellel in such a way that two separate instances of CoreNLPServer are attempted to be started on 9000 at the same time.

I'll be looking into this issue, and otherwise I'll simply set these 6 tests to always be skipped. The test failures can be ignored in this PR.

ekaf · 2022-10-05T12:38:21Z

@tomaarsen, ok but it still seems mysterious that Ubuntu only fails with Python 3.9, and succeeds with the other Python versions.

tomaarsen · 2022-10-05T12:41:58Z

They fail arbitrarily. Initially, they failed for all 4 versions. Then, I restarted the failing ones, and 3 of the 4 passed, leaving only the Python 3.9 one to fail.

ekaf · 2022-10-05T14:44:12Z

Some language codes used by the Extended OMW were retired in the latest version of the iso639-3 standard. Now, this PR also supports the retired codes, so we can get a language name for all the wordnets in OMW-1.4 plus the Extended OMW (1222 wordnets, slightly fewer languages):


from nltk.corpus import wordnet as wn
from nltk.langnames import langname

wn.add_omw()
wn.add_exomw()

for lang in wn.langs():
    code = lang.split("_")[0]
    print(f"{lang}: {langname(code)}")

eng: English
als: Tosk Albanian
arb: Standard Arabic
bul: Bulgarian
cmn: Mandarin Chinese
[...]
zro_wikt: Záparo
zsm_wikt: Standard Malay
zul_wikt: Zulu
zun_wikt: Zuni
zza_wikt: Zaza

stevenbird · 2022-10-06T11:49:12Z

I think generic functionality like language identity belongs in a top-level module.

ekaf · 2022-10-06T16:41:29Z

The hardest part may be to obtain approval from iso639-3.sil.org.
I can ask them, unless somebody else wants to.

ekaf · 2022-10-07T10:36:55Z

An alternative could be to ask users to download the data from iso639-3.sil.org, but this seems more cumbersome, since it would require checking at each call that the functionality is available. Any suggestions?

tomaarsen · 2022-10-07T10:40:04Z

This functionality seems like a convenience function, that someone should be able to quickly import and run a language identifier through. If the user would have to first download some file from iso639-3.sil.org, then it'll be considerably easier for them to just use their search engine of choice to quickly find out what the language code means.
So I would rather not go that route.

ekaf · 2022-10-07T18:59:10Z

This use of the iso639-3 data seems to conform to the registration authority's guidelines. Here's their response:

From: Registrar ISO639-3 iso639-3@sil.org
Date: Fri, 7 Oct 2022 11:50:45 -0500
Subject: Re: Terms of use

Dear Eric,

Your use described in this proposal appears to conform to our terms of use
https://iso639-3.sil.org/code_tables/download_tables#termsofuse. You may
request to be informed
https://iso639-3.sil.org/code_changes/requesting_notification_of_changes
about updates to the download tables, as we do not have an API, and there
are updates to the tables posted annually. In addition, we anticipate that
changes be approved on a more frequent basis starting in the next year.

Kind regards,

Janell

ekaf · 2022-10-11T11:55:53Z

Parenthesized text in the language names should not be discarded, because it would make the iso639name dictionary surjective, thus breaking the bijectivity of the mapping. This also happens if the retired codes are added to the same dictionary.

The problem is fixed now, by keeping the retired codes in a separate iso639retired dictionary, and also keeping parenthesized text.

ekaf · 2022-10-12T05:48:33Z

Although each dictionary here is a bijective mapping, the wrapper functions, which combine a main dictionary with one for retired codes, deviate from bijectivity in the cases where the code for a particular language is retired and replaced by a new code for the same language.
In these cases, the langcode() wrapper function return only the new iso639-3 code, not the retired code:

import nltk.langnames as lgn

for (ret_code, ret_name) in lgn.iso639retired.items():
    code = lgn.langcode(ret_name)
    if ret_code != code:
        print(f"{ret_code}:{code}:{ret_name}")

fri:fry:Western Frisian
cit:ctg:Chittagonian
flm:cfm:Falam Chin
nhs:npl:Southeastern Puebla Nahuatl
aiz:aiw:Aari
azr:adz:Adzera
bcx:pmf:Pamona
bii:bzi:Bisu
blu:hnj:Hmong Njua
fiz:izr:Izere
kds:lhi:Lahu Shi
mdo:gso:Southwest Gbaya
nky:kix:Khiamniungan Naga
ork:okv:Orokaiva
rjb:rjs:Rajbanshi
suf:tpf:Tarpia
suh:sxb:Suba
mly:zlm:Malay (individual language)
muw:unr:Mundari
xst:stv:Silt'e
scc:srp:Serbian
scr:hrv:Croatian
agp:prf:Paranan
rmr:rmq:Caló
sul:sgd:Surigaonon
wgw:wgb:Wagawaga
bjq:bzc:Southern Betsimisaraka Malagasy
nbf:nxq:Naxi
baz:tvu:Tunen
kpp:jkp:Paku Karen
wiw:wgu:Wirangu
yen:ynq:Yendang
daf:dnj:Dan
djl:dze:Djiwarli
nlr:nrk:Ngarla
wit:wnw:Wintu
yiy:yyr:Yir Yoront
sap:spn:Sanapaná
duj:dwu:Dhuwal
kjf:klj:Khalaj
kxu:uki:Kui (India)
sdm:sdq:Semandang
gji:gyz:Geji
lno:lgo:Lango (South Sudan)
wya:wyn:Wyandot

ekaf · 2022-10-28T11:23:41Z

@tomaarsen, I noticed that I had committed some unwanted files, so I reset this branch, which may accidentally have overwritten your upgrade of pyupgrade (I am not too sure of this). Anyway, the error seems the same as before:

2022-10-28T10:38:43.9604962Z Traceback (most recent call last):
2022-10-28T10:38:43.9605731Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/bin/pyupgrade", line 5, in <module>
2022-10-28T10:38:43.9606077Z     from pyupgrade._main import main
2022-10-28T10:38:43.9606652Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_main.py", line 30, in <module>
2022-10-28T10:38:43.9607042Z     from pyupgrade._data import FUNCS
2022-10-28T10:38:43.9607610Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_data.py", line 126, in <module>
2022-10-28T10:38:43.9607974Z     _import_plugins()
2022-10-28T10:38:43.9608557Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_data.py", line 123, in _import_plugins
2022-10-28T10:38:43.9609015Z     __import__(name, fromlist=['_trash'])
2022-10-28T10:38:43.9609601Z   File "/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/pyupgrade/_plugins/pep584.py", line 5, in <module>
2022-10-28T10:38:43.9610006Z     from tokenize_rt import List
2022-10-28T10:38:43.9610587Z ImportError: cannot import name 'List' from 'tokenize_rt' (/home/runner/.cache/pre-commit/repo825n37x5/py_env-python3.10/lib/python3.10/site-packages/tokenize_rt.py)
2022-10-28T10:38:43.9610895Z 
2022-10-28T10:39:06.1911460Z black....................................................................�[42mPassed�[m
2022-10-28T10:39:07.6552814Z isort....................................................................�[42mPassed�[m
2022-10-28T10:39:07.6773891Z ##[error]The process '/opt/hostedtoolcache/Python/3.10.8/x64/bin/pre-commit' failed with exit
``` code 1

ekaf · 2022-11-03T10:55:20Z

Added a bcp47 CorpusReader, which needs the new bcp47
nltk_data package (nltk/nltk_data#191).

Iso 639-3 is now supported in langnames_py through bcp47, so all doctests fail if the bcp47 data package is missing.

Now it throws e.g. ``` ...\nltk_3060.py:9: UserWarning: Shortening 'smo' to 'sm' print(f"{lang}: {langname(code)}") ``` Rather than ``` ...\nltk\langnames.py:64: UserWarning: Shortening zha to za warn(f"Shortening {code} to {code2}") ```

tomaarsen

@ekaf

I've made some minor changes and one fix. The tests pass for me (locally), so I'm nearing the point where I'm ready to merge, although I do have some changes that still need doing at nltk/nltk_data#191.

That said, I do have some questions, see my comments. This is looking good, though! Glad we want the NLTK data route.

nltk/corpus/reader/bcp47.py

…synsets

…to langnames

ekaf · 2022-12-07T10:47:43Z

Thanks for all your changes @tomaarsen, and in particular for catching the nasty confusion between 'tag' and 'subtag'.

tomaarsen · 2022-12-07T12:13:12Z

Wonderful! Thank you for this!

ekaf added 8 commits October 5, 2022 11:06

Add support for ISO-639-3 language codes

46fb31b

Add langname() function with doctest

f2b9ce4

Add file header

cbfc2ec

Revert "Add langname() function with doctest"

5b15da6

This reverts commit f2b9ce4.

Revert "Add file header"

a279fed

This reverts commit cbfc2ec.

Revert "Add support for ISO-639-3 language codes"

acb8c23

This reverts commit 46fb31b.

Merge remote-tracking branch 'upstream/develop' into develop

85c3fc6

Add support for iso639-3 language codes

86fc0fc

Add support for retired language codes

5f8b79c

Move langnames.py to the top-level

a2520fa

Add langcode() function

0b5faee

Add iso639retired dictionary

de2b30f

Improve wrapper functions

7d02c2d

ekaf added 3 commits October 12, 2022 09:37

Add module docstring with doctest

0208a35

Add 2-letter language codes

7e9ad85

Add regular expression check

ba45696

ekaf force-pushed the langnames branch from f167122 to ba45696 Compare October 28, 2022 10:38

Support BCP-47

69caa41

ekaf mentioned this pull request Nov 3, 2022

Add bcp47 data for handling language tags nltk/nltk_data#191

Merged

ekaf changed the title ~~(Draft) Add support for iso639-3 language codes~~ Support both iso639-3 codes and BCP-47 language tags Nov 3, 2022

ekaf and others added 8 commits November 5, 2022 16:45

Avoid deprecated langcodes

ab86517

Merge branch 'nltk:develop' into langnames

7173bb0

Merge remote-tracking branch 'upstream/develop' into langnames

8c129a2

Dict key membership is equivalent to dict membership

18d81d2

Dict key membership is equivalent to dict membership

4d0d6f6

Resolve bug: subtag -> tag

48f8729

Use !r for repr formatting in warnings

b21b5bb

github-actions bot added the corpus label Dec 6, 2022

tomaarsen approved these changes Dec 6, 2022

View reviewed changes

nltk/corpus/reader/bcp47.py Outdated Show resolved Hide resolved

nltk/corpus/reader/bcp47.py Show resolved Hide resolved

ekaf and others added 6 commits December 6, 2022 13:24

Fix all_synsets() function

7c86c31

Add simple regression tests for nltk#3077

60b19b7

Merge branch 'develop' of https://github.com/nltk/nltk into ekaf-all_…

518a28a

…synsets

Merge commit 'refs/pull/3060/head' of https://github.com/nltk/nltk in…

9154a5b

…to langnames

Merge commit 'refs/pull/3078/head' of https://github.com/nltk/nltk in…

0dd1be1

…to langnames

Eventually store a second 'variant' field

f00bb62

github-actions bot added the wordnet label Dec 7, 2022

ekaf added 2 commits December 7, 2022 12:00

Capitalize BCP47 in CorpusReader name

bf3a0cd

Fix conflict with updated wordnet.py

b9d3199

github-actions bot removed the wordnet label Dec 7, 2022

Reimplement removed type hint changes from nltk#3081

11f47fc

tomaarsen merged commit f019fbe into nltk:develop Dec 7, 2022

tomaarsen mentioned this pull request Dec 7, 2022

Refactor dispersion plot #3082

Merged

ekaf deleted the langnames branch December 8, 2022 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support both iso639-3 codes and BCP-47 language tags #3060

Support both iso639-3 codes and BCP-47 language tags #3060

ekaf commented Oct 5, 2022 •

edited

ekaf commented Oct 5, 2022

tomaarsen commented Oct 5, 2022

ekaf commented Oct 5, 2022

tomaarsen commented Oct 5, 2022

ekaf commented Oct 5, 2022 •

edited

stevenbird commented Oct 6, 2022

ekaf commented Oct 6, 2022

ekaf commented Oct 7, 2022

tomaarsen commented Oct 7, 2022

ekaf commented Oct 7, 2022

ekaf commented Oct 11, 2022 •

edited

ekaf commented Oct 12, 2022 •

edited

ekaf commented Oct 28, 2022

ekaf commented Nov 3, 2022

tomaarsen left a comment

ekaf commented Dec 7, 2022

tomaarsen commented Dec 7, 2022

Support both iso639-3 codes and BCP-47 language tags #3060

Support both iso639-3 codes and BCP-47 language tags #3060

Conversation

ekaf commented Oct 5, 2022 • edited

ekaf commented Oct 5, 2022

tomaarsen commented Oct 5, 2022

ekaf commented Oct 5, 2022

tomaarsen commented Oct 5, 2022

ekaf commented Oct 5, 2022 • edited

stevenbird commented Oct 6, 2022

ekaf commented Oct 6, 2022

ekaf commented Oct 7, 2022

tomaarsen commented Oct 7, 2022

ekaf commented Oct 7, 2022

ekaf commented Oct 11, 2022 • edited

ekaf commented Oct 12, 2022 • edited

ekaf commented Oct 28, 2022

ekaf commented Nov 3, 2022

tomaarsen left a comment

Choose a reason for hiding this comment

ekaf commented Dec 7, 2022

tomaarsen commented Dec 7, 2022

ekaf commented Oct 5, 2022 •

edited

ekaf commented Oct 5, 2022 •

edited

ekaf commented Oct 11, 2022 •

edited

ekaf commented Oct 12, 2022 •

edited