Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Intl] Exclude legacy languages? #33165

Closed
ro0NL opened this issue Aug 14, 2019 · 12 comments
Closed

[RFC][Intl] Exclude legacy languages? #33165

ro0NL opened this issue Aug 14, 2019 · 12 comments
Labels
Intl RFC RFC = Request For Comments (proposals about features that you want to be discussed)

Comments

@ro0NL
Copy link
Contributor

ro0NL commented Aug 14, 2019

Symfony version(s) affected: 3.4

When we generate the list of language names for the Intl component we rely on https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt

This provides many translations, but is not the authoritative code list as-is.

See e.g. no and sh

which are marked legacy in the metadata file, respectively

The problem comes with ISO vs. ICU. ISO qualifies e.g. no a macrolanguage (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) and seems valid today: https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no, but sh is not: https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sh

Not sure what to do :) Issue for now, to move forward in #33140

For now this causes the list to (alpha2 vs alpha3) to vary so it seems.

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

I am from Norway, and I hold the strong opinion that the deprecation of "no" and/or "nor" as a macrolanguage is a bug.

@ro0NL
Copy link
Contributor Author

ro0NL commented Aug 14, 2019

We can also wait a bit to see what happens on the next ICU update, though im not aware of any release dates

also getting the data consistent upstream (assuming ISO is right) would be ideal, however im not sure where it comes from actually since ICU in turn merges from CLDR data.

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

I think the only issue here is that sh is deprecated according to the official source https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sh but is still not depricated in the icu4c source.

Since this causes a problem for us whith alpha2 codes not having the same list of languages as alpha3 codes I think we should follow the official source, and just add our own patch to the data.

@ro0NL
Copy link
Contributor Author

ro0NL commented Aug 14, 2019

I think the only issue here is that sh is deprecated according to the official source https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sh but is still not depricated in the icu4c source.

sh is marked legacy in ICU also https://github.com/unicode-org/icu/blob/master/icu4c/source/data/misc/metadata.txt#L1234

this is consistent with ISO, but no is incosistent (it's not e.g. deprecated on ISOs side).

another issue is ICU points nor to nb

no{
    reason{"legacy"}
    replacement{"nb"}
}
nob{
    reason{"overlong"}
    replacement{"nb"}
}
nor{
    reason{"overlong"}
    replacement{"nb"}
}

so this would still cause different lists (alpha2=nb vs alpha3=nor,nob), which seems weird i agree 😅

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

Yes, so the solution is clear to me then:

  • Drop sh because it is legacy.
  • Make no point to nor and nor point to no

I think we will have to wait for a very long time for this to get sorted in the upstream, so I think we should patch the data after we read it from the ICU source.

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

Any reference to sh in the locales (list of translated languge names) should be treated as if it was sr_Latn

@ro0NL
Copy link
Contributor Author

ro0NL commented Aug 14, 2019

i'd like to see the impacted data if we skip all reason{"legacy"}, and whitelist as needed (e.g. no is not legacy and maps to nor, as verified by you). So yes, curious if there are more similar cases :)

here's another interesting case 😓 https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt#L77 which is also not ISO :)

Any reference to sh in the locales (list of translated languge names) should be treated as if it was sr_Latn

sr_Latn is already included on itself, and during lookup we resolve aliases. So yes, i agree if we decide to exclude legacy languages, we should also exclude locales bound to those languages from the list

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

I think "Serbo-Croatian" is very much a live language and language name. It is just that it should be referenced as sr_Latn instead of sh. It is like the other localized language names that does not have any own two letter or three letter ISO code, but is bound to a more specialized locale.

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

My point was that if an ISO code is marked as legacy (like sh), it is the code that is legacy and not necessarily the language name it points to. That we have a locale bound to that language that is not marked as legacy testifies to that the language itself is not legacy.

@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

Yes, I agree that "American Sign Language" should not be there.

@ro0NL
Copy link
Contributor Author

ro0NL commented Aug 14, 2019

My point was that if an ISO code is marked as legacy (like sh), it is the code that is legacy and not necessarily the language name it points to. That we have a locale bound to that language that is not marked as legacy testifies to that the language itself is not legacy.

not sure i understand that, or the actual concern raised

my point is the locale list currently includes aliases (i qualify that duplication):

"sh": "Serbo-Croatian",
"sh_BA": "Serbo-Croatian (Bosnia & Herzegovina)",

whereas ICU tells us to use sr(_Latn) instead:

"sr": "Serbian",
"sr_BA": "Serbian (Bosnia & Herzegovina)",

"sr_Latn": "Serbian (Latin)",
"sr_Latn_BA": "Serbian (Latin, Bosnia & Herzegovina)",

@xabbuh xabbuh added Intl RFC RFC = Request For Comments (proposals about features that you want to be discussed) labels Aug 14, 2019
@TerjeBr
Copy link

TerjeBr commented Aug 14, 2019

According to https://en.wikipedia.org/wiki/Serbo-Croatian

ISO classification

Since the year 2000, the ISO classification does not recognize Serbo-Croatian as an individual language. Originally included, it has been removed from the ISO 639-1 and ISO 639-2 standards,[130] and consequently redefined as a "macrolanguage", a book-keeping device in the ISO 639-3 standard.[131]

If we want to keep it as a makrolanguage the alpha3 code can be gotten from ISO 639-3
https://iso639-3.sil.org/code/hbs

After reading https://en.wikipedia.org/wiki/Serbo-Croatian I guess you are right, "Serbo-Croatian" as a lanaguage itself is what has been marked as legacy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Intl RFC RFC = Request For Comments (proposals about features that you want to be discussed)
Projects
None yet
Development

No branches or pull requests

3 participants