Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement idna on top of ICU4X #923

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from
Draft

Reimplement idna on top of ICU4X #923

wants to merge 32 commits into from

Conversation

hsivonen
Copy link
Contributor

@hsivonen hsivonen commented Apr 9, 2024

Opening as draft PR to enable early feedback while the dependency remains unlanded in ICU4X.

The motivation of reformulating the idna crate on top of ICU4X is to be able to move Firefox's IDNA handling to use ICU4X (instead of the current combination of ICU4C and very old code). The ICU4X normalizer is faster than unicode-normalization and the ICU4X normalizer represents UTS 46 data as a normalization as opposed to representing it separately like the idna crate currently does.

The benchmarks in the idna crate itself show this PR to result in faster performance. This is also more correct than the old code: I removed skipping of the ContextJ tests from the harness that runs the UTS 46 test suite.

See the added README for removed capabilities. I searched for GitHub for public code using the idna crate, and I believe the removals to be mostly not need action from the ecosystem and to be tolerable when they do.

For projects that use ICU4X for normalization (or collation), this change has the benefit of deduplicating data across normalization and IDNA handling. There is the ecosystem risk of causing projects that use unicode-normalization for normalization in ways other than as a dependency of idna to end up with more data. One way to mitigate that (already preliminarily discussed with the maintainer) would be to introduce a cargo feature to unicode-normalization that would delegate the unicode-normalization internals to ICU4X (better performance, more crates in the dependency tree).

Not properly investigated yet: Binary size impact.

@valenting
Copy link
Collaborator

I think you need to explicitly add a dependency for the icu crate, instead of using a relative path

idna/Cargo.toml Outdated
unicode-bidi = { version = "0.3.10", default-features = false, features = ["hardcoded-data"] }
unicode-normalization = { version = "0.1.22", default-features = false }
icu_normalizer = { path = "../../icu4x/components/normalizer", features = ["compiled_data"] }
icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }
icu_properties = { version = "1.4.0",features = ["compiled_data"] }

Copy link
Collaborator

@valenting valenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a direct dependency on the icu crates

@hsivonen
Copy link
Contributor Author

Yeah, the dependency declaration will change when this becomes a non-draft.

@hsivonen
Copy link
Contributor Author

hsivonen commented Apr 24, 2024

Using the demo https://github.com/hsivonen/urldemo (strip=true, lto=true, opt_level="z") and binaryen wasm-opt -Oz on the result, I get 215085 bytes with this patch and 310986 without, so this should not only improve performance but should also make (Wasm at least) binary size smaller.

@hsivonen hsivonen mentioned this pull request Apr 24, 2024
Copy link
Contributor

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who spent a bunch of time optimizing the idna crate a few years ago, cool to see more speedups here! Here's a bunch of stylistic suggestions, which could be applied more generally to a bunch of the code that was rewritten here.

idna/src/deprecated.rs Show resolved Hide resolved
idna/src/deprecated.rs Show resolved Hide resolved
idna/src/deprecated.rs Outdated Show resolved Hide resolved
idna/src/deprecated.rs Outdated Show resolved Hide resolved
idna/src/deprecated.rs Show resolved Hide resolved
idna/src/punycode.rs Show resolved Hide resolved
idna/src/punycode.rs Outdated Show resolved Hide resolved
idna/src/punycode.rs Show resolved Hide resolved
…or that does not check hyphens in positions 3 and 4
…s required)

Since other changes in this changeset require a semver break anyway, this
change takes a semver break in the case of `default-features = false` in
order to avoid a future semver break if in the future a need to add a
bring-your-own-data (using `icu_provider`) constructor for `Uts46`
shows up.
@hsivonen
Copy link
Contributor Author

hsivonen commented May 3, 2024

Since these changes require a semver increment anyway, I took the opportunity to add a currently-required compiled_data feature in order to future-proof against having to take a semver break if a use case for dynamic data loading using the ICU4X provider shows up. (CC @sffc )

From my perspective, this PR is now done expect for changing the ICU4X dependencies to point to crates.io once unicode-org/icu4x#4712 has landed and been published to crates.io. Leaving this PR in the draft state until then, but review is welcome before changing to non-draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants