Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting retrieval of the language preference list from the system #3990

Open
hsivonen opened this issue Sep 1, 2023 · 19 comments
Labels
C-locale Component: Locale identifiers, BCP47 good first issue Good for newcomers S-large Size: A few weeks (larger feature, major refactoring) U-ecma402 User: ECMA-402 compatibility

Comments

@hsivonen
Copy link
Member

hsivonen commented Sep 1, 2023

Consider providing functionality (with std, not with no_std) for retrieving the user's system-level preference list of languages as ICU4X locales.

On Windows, Gecko prefers https://learn.microsoft.com/en-us/uwp/api/windows.system.userprofile.globalizationpreferences.languages?view=winrt-22621#windows-system-userprofile-globalizationpreferences-languages and adds region with likely subtags if the system gives a language only.

On Mac, Gecko uses https://developer.apple.com/documentation/corefoundation/1542887-cflocalecopypreferredlanguages

On Android, Gecko prefers https://developer.android.com/reference/android/os/LocaleList#getDefault() . Not sure if it's practical to call a Java method, even a static one, deep within Rust code when the Rust code isn't responsible for the whole app's JNI setup.

On Gtk, Gecko delegates to ICU4C, which AFAICT, calls setlocale(LC_MESSAGES, NULL); and performs fixup. It appears (note the author of the answer) that it's OK to call glibc setlocale to read get (not actually set) a value, and nothing else in the process actually sets the value, either, so that it's constant for the lifetime of the process. Obviously, this code path retrieving only one locale.

(Note: Gecko already has non-ICU4C C++ code for this (except on Gtk), so whereas #3059 is deliberately U-gecko-tagged, I'm filing this as a general U-ecma402 courtesy without the usual implication that everything U-ecma402 is implicitly U-gecko.)

(Note 2: ECMA-402 default locale isn't a preference list and is implied to have data available for it across all the ECMA-402 objects, so implementing ECMA-402 on top of what's suggested above would involve further filtering.)

@hsivonen hsivonen added C-locale Component: Locale identifiers, BCP47 U-ecma402 User: ECMA-402 compatibility labels Sep 1, 2023
@robertbastian
Copy link
Member

Related: #3059

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Sep 21, 2023
@sffc
Copy link
Member

sffc commented Sep 21, 2023

Is this in scope? Seeking feedback from:

@Manishearth
Copy link
Member

I think this is in scope as a utils crate we publish separately, perhaps with icu_locid integration. I don't consider this priority.

@zbraniecki
Copy link
Member

I would like to consider this out of scope of ICU. I would name such create env_i18n or something similar and I'd be happy to suggest we maintain it and ensure that it uses ICU primitives but I don't like that ICU right now has hooks into env/os. I think separation of concerns is valuable.

@sffc
Copy link
Member

sffc commented Sep 21, 2023

Based on feedback from the i18n unconference at RustConf, devs want libraries that "just work" and integrating nicely with the operating system is part of that. We are in a decent position to write this type of code. I don't know where exactly this code lands, but if it lands in the icu4x repository, it should probably be under utils.

@zbraniecki
Copy link
Member

Ok, I'm comfortable with this as long as it's explicit.

@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Sep 21, 2023
@sffc sffc added S-large Size: A few weeks (larger feature, major refactoring) good first issue Good for newcomers labels Oct 5, 2023
@sffc sffc added this to the Priority Backlog ⟨P3⟩ milestone Oct 5, 2023
@sffc
Copy link
Member

sffc commented Oct 5, 2023

This is likely a "good first issue" because the API surface is small, mostly a function that returns the system locale as a icu::locid::Locale, which works on all platforms according to what @hsivonen posted above.

@sffc
Copy link
Member

sffc commented Feb 5, 2024

Comment from @VorpalBlade in #4580:

Hi,

I'm looking into options for localising one of my Rust command line program, and I'm a bit confused about icu4x: How do you actually get from a system locale to the proper ICU4X settings? E.g. consider something like this mixed POSIX locale (what I actually use):

$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

In which ICU crate is the logic to resolve this (and whatever Windows and Mac OS uses, I'm not familiar with programming on those platforms) to the proper locale to use in ICU4X for each feature?

(The reason I use a mixed locale like this, and also many other people I know do, is that message translations tend to be poor and translated error messages ungooglable. But I still want Swedish dates, 24 hour dates, decimal comma, etc)

@ashu26jha
Copy link
Contributor

Here is my finding so far:

  1. There is a windows-rs crate which we will use to call APIs in this fashion. Docs

  2. In MAC we need to use FFI, as we need to call C# functions. May need to use objc as I am not it would be possible through Diplomat.

  3. In Ubuntu, we have two options to choose from :
    a. Use std::env, but there is a slight chance that it may not be set in the env, this could panic
    b. We could run a command from Rust which will return the locale.

Thoughts regarding this @ everyone in this thread?

@VorpalBlade
Copy link

VorpalBlade commented Mar 20, 2024

@ashu26jha I'm only going to comment on Linux / non-Mac *nix, since that is the only thing I'm even remotely qualified to talk about:

It shouldn't be too hard to implement the resolution logic for *nix in pure rust after reading the environment variables (that may or may not be set). It is after all standardised by POSIX. There is a really long section of the POSIX standard on locales: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html A lot of that is not needed here as it is about a definition language for locales.

In this case see https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08 (section 8.2) for locale resolution order. On modern systems I believe the fallback in practice when nothing is set is C.UTF-8, it used to be just the C locale (ASCII or indeterminate encoding I believe).

The values of locale categories shall be determined by a precedence order; the first condition met below determines the value:

  1. If the LC_ALL environment variable is defined and is not null, the value of LC_ALL shall be used.

  2. If the LC_* environment variable (LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME) is defined and is not null, the value of the environment variable shall be used to initialize the category that corresponds to the environment variable.

  3. If the LANG environment variable is defined and is not null, the value of the LANG environment variable shall be used.

  4. If the LANG environment variable is not set or is set to the empty string, the implementation-defined default locale shall be used.

If the locale value is "C" or "POSIX", the POSIX locale shall be used and the standard utilities behave in accordance with the rules in POSIX Locale for the associated category.

At least glibc also have some additional categories (LC_PAPER for example). But I'm on my phone so I can't easily check which other ones.

You could also consider trying to interpret locale definitions from POSIX (e.g. what the name of the months are etc), but I'm under the impression that IC4UX would probably prefer to use their own mapping from sv_SE and sv_FI etc to "januari" etc. So I believe for the purposes of ICU4X what you need is 1) resolution order logic 2) mapping locale names (if they don't match exactly).

@JMoogs
Copy link

JMoogs commented Mar 21, 2024

Hello,

I'm hoping to tackle this issue for GSoC, and in my research, I believe I've found reasonable retrieval methods for each OS:

  • On Windows, as suggested above, the windows crate offers rust bindings to the Windows API and the GlobalizationPreferences struct and its methods can be used. Alternatively, windows-syscan be used if preferred.

  • On Mac and iOS, the locale is accessible through the core-foundation-sys crate using CFLocaleCopyPreferredLanguages. Unfortunately, core-foundation doesn't seem to have a safe wrapper for this as of yet.

  • On Linux, if the environment variables are unset, libc can be used to call lic::setlocale(libc::LC_*, ptr::null()). On my NixOS install, this returns "C".

  • Finally, on Android, I believe we can use the __system_property_get method from libc (relevant documentation), with persist.sys.locale as the first port of call: relevant C++ code).

I feel that writing the FFI bindings from scratch for Windows and Mac should be relatively trivial if we wish to reduce dependencies, though this will of course involve more unsafe code.

PS: I feel this issue is quite detached from the rest of the crate - may someone point me towards some relevant tasks I can attempt, to gain some familiarity with the crate?

@VorpalBlade
Copy link

@JMoogs I don't belive that would work on Unix, due to rust-lang/rust#27970

Basically it is unsound to call C functions that read the environment. If you can read the environment variables from Rust instead it should be fine (as std has a lock internally).

@DemiMarie
Copy link

@VorpalBlade I think the proper fix is on the Rust side, by having the setter functions panic!() if multiple threads are running.

@VorpalBlade
Copy link

@VorpalBlade I think the proper fix is on the Rust side, by having the setter functions panic!() if multiple threads are running.

That would be great, unfortunately not much have happened in recent years with that bug. :(

@JMoogs
Copy link

JMoogs commented Mar 21, 2024

@JMoogs I don't belive that would work on Unix, due to rust-lang/rust#27970

Basically it is unsound to call C functions that read the environment. If you can read the environment variables from Rust instead it should be fine (as std has a lock internally).

I under the impression that a locale could be set without having an associated environment variable - it seems this isn't the case and so a pure Rust implementation should work on Linux.

@VorpalBlade
Copy link

In glibc, setlocale is documented as MT-Unsafe at least, so that is worth considering.

@ashu26jha
Copy link
Contributor

ashu26jha commented Mar 23, 2024

@VorpalBlade I went through your links, good resource but I happen find one corner case:

If the locale value begins with a slash, it shall be interpreted as the pathname of a file that was created in the output format used by the localedef utility; see OUTPUT FILES under localedef. Referencing such a pathname shall result in that locale being used for the indicated category.

I think we need to think about this case, we could have an enum which looks something like this:

enum Locale {
     string,
     path
}

I personally feel the crux of this feature is not getting the locales but actually it's making sure they map correctly (locales names if they don't match) which will require testing it thoroughly

@ashu26jha
Copy link
Contributor

ashu26jha commented Mar 24, 2024

To keep up with the modularity of the proposed crate, I think following should be workflow:

  1. Get the system's locale in the form of string using FFI or std (We may need to use other crates if it is too hard and unsafe to implement)
  2. Build a converter which takes these strings as an input and returns a Locale

The need for converter is that it would make a common ground and standardization for adding more Operating System in our coverage.

This converter needs to take care of the cases where we don't have a direct mapping for eg:

let locale: Locale = locale!("C");

The above code will fail, so we need to build a mapping to handle these corner cases.

@ashu26jha
Copy link
Contributor

As highlighted by @hsivonen for android:

Not sure if it's practical to call a Java method, even a static one, deep within Rust code when the Rust code isn't responsible for the whole app's JNI setup.

We could introuduce a C/C++ layer in between the Java & Rust. Directly handling JNI from Rust is not the best way to move forward. Most of the overhead shall be handled by this layer. It will retrieve the results from the JNI call, converting them back into a format suitable for C/C++ (and ultimately for Rust).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-locale Component: Locale identifiers, BCP47 good first issue Good for newcomers S-large Size: A few weeks (larger feature, major refactoring) U-ecma402 User: ECMA-402 compatibility
Projects
None yet
Development

No branches or pull requests

9 participants