[RFC][Intl] Exclude legacy languages? #33165

ro0NL · 2019-08-14T10:12:07Z

Symfony version(s) affected: 3.4

When we generate the list of language names for the Intl component we rely on https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt

This provides many translations, but is not the authoritative code list as-is.

See e.g. no and sh

which are marked legacy in the metadata file, respectively

The problem comes with ISO vs. ICU. ISO qualifies e.g. no a macrolanguage (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) and seems valid today: https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=no, but sh is not: https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sh

Not sure what to do :) Issue for now, to move forward in #33140

For now this causes the list to (alpha2 vs alpha3) to vary so it seems.

The text was updated successfully, but these errors were encountered:

TerjeBr · 2019-08-14T10:43:28Z

I am from Norway, and I hold the strong opinion that the deprecation of "no" and/or "nor" as a macrolanguage is a bug.

ro0NL · 2019-08-14T11:15:40Z

We can also wait a bit to see what happens on the next ICU update, though im not aware of any release dates

also getting the data consistent upstream (assuming ISO is right) would be ideal, however im not sure where it comes from actually since ICU in turn merges from CLDR data.

TerjeBr · 2019-08-14T11:16:40Z

I think the only issue here is that sh is deprecated according to the official source https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sh but is still not depricated in the icu4c source.

Since this causes a problem for us whith alpha2 codes not having the same list of languages as alpha3 codes I think we should follow the official source, and just add our own patch to the data.

ro0NL · 2019-08-14T11:34:14Z

I think the only issue here is that sh is deprecated according to the official source https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=sh but is still not depricated in the icu4c source.

sh is marked legacy in ICU also https://github.com/unicode-org/icu/blob/master/icu4c/source/data/misc/metadata.txt#L1234

this is consistent with ISO, but no is incosistent (it's not e.g. deprecated on ISOs side).

another issue is ICU points nor to nb

no{
    reason{"legacy"}
    replacement{"nb"}
}
nob{
    reason{"overlong"}
    replacement{"nb"}
}
nor{
    reason{"overlong"}
    replacement{"nb"}
}

so this would still cause different lists (alpha2=nb vs alpha3=nor,nob), which seems weird i agree 😅

TerjeBr · 2019-08-14T11:46:43Z

Yes, so the solution is clear to me then:

Drop sh because it is legacy.
Make no point to nor and nor point to no

I think we will have to wait for a very long time for this to get sorted in the upstream, so I think we should patch the data after we read it from the ICU source.

TerjeBr · 2019-08-14T11:54:24Z

Any reference to sh in the locales (list of translated languge names) should be treated as if it was sr_Latn

ro0NL · 2019-08-14T12:02:01Z

i'd like to see the impacted data if we skip all reason{"legacy"}, and whitelist as needed (e.g. no is not legacy and maps to nor, as verified by you). So yes, curious if there are more similar cases :)

here's another interesting case 😓 https://github.com/unicode-org/icu/blob/master/icu4c/source/data/lang/en.txt#L77 which is also not ISO :)

Any reference to sh in the locales (list of translated languge names) should be treated as if it was sr_Latn

sr_Latn is already included on itself, and during lookup we resolve aliases. So yes, i agree if we decide to exclude legacy languages, we should also exclude locales bound to those languages from the list

TerjeBr · 2019-08-14T12:07:07Z

I think "Serbo-Croatian" is very much a live language and language name. It is just that it should be referenced as sr_Latn instead of sh. It is like the other localized language names that does not have any own two letter or three letter ISO code, but is bound to a more specialized locale.

TerjeBr · 2019-08-14T12:12:57Z

My point was that if an ISO code is marked as legacy (like sh), it is the code that is legacy and not necessarily the language name it points to. That we have a locale bound to that language that is not marked as legacy testifies to that the language itself is not legacy.

TerjeBr · 2019-08-14T12:18:12Z

Yes, I agree that "American Sign Language" should not be there.

ro0NL · 2019-08-14T12:28:52Z

My point was that if an ISO code is marked as legacy (like sh), it is the code that is legacy and not necessarily the language name it points to. That we have a locale bound to that language that is not marked as legacy testifies to that the language itself is not legacy.

not sure i understand that, or the actual concern raised

my point is the locale list currently includes aliases (i qualify that duplication):

symfony/src/Symfony/Component/Intl/Resources/data/locales/en.json

Lines 489 to 490 in 0bdf10a

    
           "sh": "Serbo-Croatian", 
        
           "sh_BA": "Serbo-Croatian (Bosnia & Herzegovina)",

whereas ICU tells us to use sr(_Latn) instead:

symfony/src/Symfony/Component/Intl/Resources/data/locales/en.json

Lines 507 to 508 in 0bdf10a

    
           "sr": "Serbian", 
        
           "sr_BA": "Serbian (Bosnia & Herzegovina)",

symfony/src/Symfony/Component/Intl/Resources/data/locales/en.json

Lines 513 to 514 in 0bdf10a

    
           "sr_Latn": "Serbian (Latin)", 
        
           "sr_Latn_BA": "Serbian (Latin, Bosnia & Herzegovina)",

TerjeBr · 2019-08-14T13:00:45Z

According to https://en.wikipedia.org/wiki/Serbo-Croatian

ISO classification

Since the year 2000, the ISO classification does not recognize Serbo-Croatian as an individual language. Originally included, it has been removed from the ISO 639-1 and ISO 639-2 standards,[130] and consequently redefined as a "macrolanguage", a book-keeping device in the ISO 639-3 standard.[131]

If we want to keep it as a makrolanguage the alpha3 code can be gotten from ISO 639-3
https://iso639-3.sil.org/code/hbs

After reading https://en.wikipedia.org/wiki/Serbo-Croatian I guess you are right, "Serbo-Croatian" as a lanaguage itself is what has been marked as legacy.

ro0NL mentioned this issue Aug 14, 2019

[Intl] Full alpha3 language support #33140

Merged

xabbuh added Intl RFC RFC = Request For Comments (proposals about features that you want to be discussed) labels Aug 14, 2019

ro0NL mentioned this issue Mar 20, 2020

[Intl] getAlpha3Code() OutOfBoundsException for Norwegian #36145

Closed

ro0NL closed this as completed Oct 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Intl] Exclude legacy languages? #33165

[RFC][Intl] Exclude legacy languages? #33165

ro0NL commented Aug 14, 2019

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019

TerjeBr commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019

ISO classification

[RFC][Intl] Exclude legacy languages? #33165

[RFC][Intl] Exclude legacy languages? #33165

Comments

ro0NL commented Aug 14, 2019

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019 • edited

TerjeBr commented Aug 14, 2019 • edited

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019 • edited

TerjeBr commented Aug 14, 2019

TerjeBr commented Aug 14, 2019 • edited

TerjeBr commented Aug 14, 2019

ro0NL commented Aug 14, 2019 • edited

TerjeBr commented Aug 14, 2019

ISO classification

ro0NL commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019 •

edited

ro0NL commented Aug 14, 2019 •

edited

TerjeBr commented Aug 14, 2019 •

edited

ro0NL commented Aug 14, 2019 •

edited