-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support both iso639-3 codes and BCP-47 language tags #3060
Conversation
@tomaarsen, curiously CI only fails with Python 3.9 on Ubuntu, but succeeds on the other platforms. The failure seems unrelated with this PR, since all the errors concern nltk/parse/corenlp.py:
|
I'm noticing the test failures. Out of Windows, Mac and Ubuntu, only the Ubuntu test setup prepares the download of third-party tools, including CoreNLP. The Windows and Mac suites simply skip these tests if it is discovered that the relevant jar files are not present. The issue seems to indicate that either:
I'll be looking into this issue, and otherwise I'll simply set these 6 tests to always be skipped. The test failures can be ignored in this PR. |
@tomaarsen, ok but it still seems mysterious that Ubuntu only fails with Python 3.9, and succeeds with the other Python versions. |
They fail arbitrarily. Initially, they failed for all 4 versions. Then, I restarted the failing ones, and 3 of the 4 passed, leaving only the Python 3.9 one to fail. |
Some language codes used by the Extended OMW were retired in the latest version of the iso639-3 standard. Now, this PR also supports the retired codes, so we can get a language name for all the wordnets in OMW-1.4 plus the Extended OMW (1222 wordnets, slightly fewer languages):
|
I think generic functionality like language identity belongs in a top-level module. |
The hardest part may be to obtain approval from iso639-3.sil.org. |
An alternative could be to ask users to download the data from iso639-3.sil.org, but this seems more cumbersome, since it would require checking at each call that the functionality is available. Any suggestions? |
This functionality seems like a convenience function, that someone should be able to quickly import and run a language identifier through. If the user would have to first download some file from iso639-3.sil.org, then it'll be considerably easier for them to just use their search engine of choice to quickly find out what the language code means. |
This use of the iso639-3 data seems to conform to the registration authority's guidelines. Here's their response:
|
Parenthesized text in the language names should not be discarded, because it would make the iso639name dictionary surjective, thus breaking the bijectivity of the mapping. This also happens if the retired codes are added to the same dictionary. The problem is fixed now, by keeping the retired codes in a separate iso639retired dictionary, and also keeping parenthesized text. |
Although each dictionary here is a bijective mapping, the wrapper functions, which combine a main dictionary with one for retired codes, deviate from bijectivity in the cases where the code for a particular language is retired and replaced by a new code for the same language.
|
@tomaarsen, I noticed that I had committed some unwanted files, so I reset this branch, which may accidentally have overwritten your upgrade of pyupgrade (I am not too sure of this). Anyway, the error seems the same as before:
|
Added a bcp47 CorpusReader, which needs the new bcp47 Iso 639-3 is now supported in langnames_py through bcp47, so all doctests fail if the bcp47 data package is missing. |
Now it throws e.g. ``` ...\nltk_3060.py:9: UserWarning: Shortening 'smo' to 'sm' print(f"{lang}: {langname(code)}") ``` Rather than ``` ...\nltk\langnames.py:64: UserWarning: Shortening zha to za warn(f"Shortening {code} to {code2}") ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some minor changes and one fix. The tests pass for me (locally), so I'm nearing the point where I'm ready to merge, although I do have some changes that still need doing at nltk/nltk_data#191.
That said, I do have some questions, see my comments. This is looking good, though! Glad we want the NLTK data route.
Thanks for all your changes @tomaarsen, and in particular for catching the nasty confusion between 'tag' and 'subtag'. |
Wonderful! Thank you for this! |
Fix #3058 by adding a function to obtain language names, using their respective iso639-3 language codes.
This PR adds a langname.py module in the nltk directory. I plan to use it with wordnet.py, to obtain the names of the multilingual wordnets from OMW.
The language codes were obtained from https://iso639-3.sil.org/code_tables/download_tables
and since it is not clear whether this use is authorized, it seems safer to ask them: