New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement idna on top of ICU4X #923
base: main
Are you sure you want to change the base?
Conversation
I think you need to explicitly add a dependency for the icu crate, instead of using a relative path |
idna/Cargo.toml
Outdated
unicode-bidi = { version = "0.3.10", default-features = false, features = ["hardcoded-data"] } | ||
unicode-normalization = { version = "0.1.22", default-features = false } | ||
icu_normalizer = { path = "../../icu4x/components/normalizer", features = ["compiled_data"] } | ||
icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] } | |
icu_properties = { version = "1.4.0",features = ["compiled_data"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add a direct dependency on the icu crates
Yeah, the dependency declaration will change when this becomes a non-draft. |
Using the demo https://github.com/hsivonen/urldemo (strip=true, lto=true, opt_level="z") and binaryen wasm-opt -Oz on the result, I get 215085 bytes with this patch and 310986 without, so this should not only improve performance but should also make (Wasm at least) binary size smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who spent a bunch of time optimizing the idna crate a few years ago, cool to see more speedups here! Here's a bunch of stylistic suggestions, which could be applied more generally to a bunch of the code that was rewritten here.
…or that does not check hyphens in positions 3 and 4
…s required) Since other changes in this changeset require a semver break anyway, this change takes a semver break in the case of `default-features = false` in order to avoid a future semver break if in the future a need to add a bring-your-own-data (using `icu_provider`) constructor for `Uts46` shows up.
Since these changes require a semver increment anyway, I took the opportunity to add a currently-required From my perspective, this PR is now done expect for changing the ICU4X dependencies to point to crates.io once unicode-org/icu4x#4712 has landed and been published to crates.io. Leaving this PR in the draft state until then, but review is welcome before changing to non-draft. |
Opening as draft PR to enable early feedback while the dependency remains unlanded in ICU4X.
The motivation of reformulating the
idna
crate on top of ICU4X is to be able to move Firefox's IDNA handling to use ICU4X (instead of the current combination of ICU4C and very old code). The ICU4X normalizer is faster thanunicode-normalization
and the ICU4X normalizer represents UTS 46 data as a normalization as opposed to representing it separately like theidna
crate currently does.The benchmarks in the
idna
crate itself show this PR to result in faster performance. This is also more correct than the old code: I removed skipping of the ContextJ tests from the harness that runs the UTS 46 test suite.See the added README for removed capabilities. I searched for GitHub for public code using the
idna
crate, and I believe the removals to be mostly not need action from the ecosystem and to be tolerable when they do.For projects that use ICU4X for normalization (or collation), this change has the benefit of deduplicating data across normalization and IDNA handling. There is the ecosystem risk of causing projects that use
unicode-normalization
for normalization in ways other than as a dependency ofidna
to end up with more data. One way to mitigate that (already preliminarily discussed with the maintainer) would be to introduce a cargo feature tounicode-normalization
that would delegate theunicode-normalization
internals to ICU4X (better performance, more crates in the dependency tree).Not properly investigated yet: Binary size impact.