Reimplement idna on top of ICU4X #923

hsivonen · 2024-04-09T16:17:25Z

Opening as draft PR to enable early feedback while the dependency remains unlanded in ICU4X.

The motivation of reformulating the idna crate on top of ICU4X is to be able to move Firefox's IDNA handling to use ICU4X (instead of the current combination of ICU4C and very old code). The ICU4X normalizer is faster than unicode-normalization and the ICU4X normalizer represents UTS 46 data as a normalization as opposed to representing it separately like the idna crate currently does.

The benchmarks in the idna crate itself show this PR to result in faster performance. This is also more correct than the old code: I removed skipping of the ContextJ tests from the harness that runs the UTS 46 test suite.

See the added README for removed capabilities. I searched for GitHub for public code using the idna crate, and I believe the removals to be mostly not need action from the ecosystem and to be tolerable when they do.

For projects that use ICU4X for normalization (or collation), this change has the benefit of deduplicating data across normalization and IDNA handling. There is the ecosystem risk of causing projects that use unicode-normalization for normalization in ways other than as a dependency of idna to end up with more data. One way to mitigate that (already preliminarily discussed with the maintainer) would be to introduce a cargo feature to unicode-normalization that would delegate the unicode-normalization internals to ICU4X (better performance, more crates in the dependency tree).

Not properly investigated yet: Binary size impact.

…erformance

…PI behavior

valenting · 2024-04-11T08:06:23Z

I think you need to explicitly add a dependency for the icu crate, instead of using a relative path

valenting · 2024-04-12T10:47:29Z

idna/Cargo.toml

-unicode-bidi = { version = "0.3.10", default-features = false, features = ["hardcoded-data"] }
-unicode-normalization = { version = "0.1.22", default-features = false }
+icu_normalizer = { path = "../../icu4x/components/normalizer", features = ["compiled_data"] }
+icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }


Suggested change

icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }

icu_properties = { version = "1.4.0",features = ["compiled_data"] }

valenting

We need to add a direct dependency on the icu crates

hsivonen · 2024-04-18T10:41:23Z

Yeah, the dependency declaration will change when this becomes a non-draft.

…false

…-level function

hsivonen · 2024-04-24T12:45:29Z

Using the demo https://github.com/hsivonen/urldemo (strip=true, lto=true, opt_level="z") and binaryen wasm-opt -Oz on the result, I get 215085 bytes with this patch and 310986 without, so this should not only improve performance but should also make (Wasm at least) binary size smaller.

djc

As someone who spent a bunch of time optimizing the idna crate a few years ago, cool to see more speedups here! Here's a bunch of stylistic suggestions, which could be applied more generally to a bunch of the code that was rewritten here.

idna/src/deprecated.rs

idna/src/punycode.rs

…or that does not check hyphens in positions 3 and 4

…s required) Since other changes in this changeset require a semver break anyway, this change takes a semver break in the case of `default-features = false` in order to avoid a future semver break if in the future a need to add a bring-your-own-data (using `icu_provider`) constructor for `Uts46` shows up.

hsivonen · 2024-05-03T09:13:59Z

Since these changes require a semver increment anyway, I took the opportunity to add a currently-required compiled_data feature in order to future-proof against having to take a semver break if a use case for dynamic data loading using the ICU4X provider shows up. (CC @sffc )

From my perspective, this PR is now done expect for changing the ICU4X dependencies to point to crates.io once unicode-org/icu4x#4712 has landed and been published to crates.io. Leaving this PR in the draft state until then, but review is welcome before changing to non-draft.

hsivonen added 14 commits March 20, 2024 17:43

Reimplement idna on top of ICU4X

a8977a8

Add an even faster lower-case ASCII letter path to avoid regressing p…

09765af

…erformance

Comments and verify_dns_length tweak

7e929ce

Parametrize internal vs. external Punycode caller; restore external A…

f413387

…PI behavior

Add bench for to_ascii on an already-Punycode name

71c03b9

Avoid re-encoding Punycode when possible

9af00cb

Pass through the input slice in many more cases

dc8f301

Add testing for the simultaneous mode

41e0192

Omit the invalid domain character check on the url side

41f2107

Document that Punycode labels must result in non-ASCII

4d7d41a

Rename files called uts46.rs to deprecated.rs

98ca752

Rename uts46bis to uts46

4bbabe9

Tweak docs

7dc0082

Avoid useless copying and useless UTF-8 decode

f8eb96e

valenting reviewed Apr 12, 2024

View reviewed changes

valenting requested changes Apr 12, 2024

View reviewed changes

hsivonen added 5 commits April 15, 2024 14:29

Use inline(never) to optimize binary size

eb6e3d5

Split CheckHyphens into a separate concern form the ASCII deny list

ce3d4d1

Make the ASCII deny list customizable

6672161

Better docs and top-level functions

90fe4b3

Parameter for VerifyDNSLength

50381ff

hsivonen added 6 commits April 18, 2024 14:11

Restore support for transitional processing to minimize breakage

8268c5a

In the deprecated API, use empty deny list with use_std3_ascii_rules=…

999bef4

…false

Tweak docs

b277c85

Docs, rename AsciiDenyList::WHATWG to ::URL, tweak top-level functions

980348c

Use idna crate top-level function in the url crate to dogfood the top…

4efd589

…-level function

Add an Usage section to the README

da6cf50

hsivonen mentioned this pull request Apr 24, 2024

WASM file size #557

Open

djc reviewed Apr 24, 2024

View reviewed changes

hsivonen added 7 commits April 26, 2024 14:32

Add an early return to map_transitional for readability

d938024

Document internal vs. external Punycode caller differences

679edb9

Per discussion with Valentin, revert deprecated API to the old behavi…

4f605c9

…or that does not check hyphens in positions 3 and 4

Add comments about not fixing deprecated API

bbf4308

Merge branch 'main' into icu4x

e842dae

Add a comment explaining FailFast in deprecated.rs

6690c49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement idna on top of ICU4X #923

Reimplement idna on top of ICU4X #923

hsivonen commented Apr 9, 2024

valenting commented Apr 11, 2024

valenting Apr 12, 2024

valenting left a comment

hsivonen commented Apr 18, 2024

hsivonen commented Apr 24, 2024 •

edited

djc left a comment

hsivonen commented May 3, 2024

	icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }
	icu_properties = { version = "1.4.0",features = ["compiled_data"] }

Reimplement idna on top of ICU4X #923

Are you sure you want to change the base?

Reimplement idna on top of ICU4X #923

Conversation

hsivonen commented Apr 9, 2024

valenting commented Apr 11, 2024

valenting Apr 12, 2024

Choose a reason for hiding this comment

valenting left a comment

Choose a reason for hiding this comment

hsivonen commented Apr 18, 2024

hsivonen commented Apr 24, 2024 • edited

djc left a comment

Choose a reason for hiding this comment

hsivonen commented May 3, 2024

hsivonen commented Apr 24, 2024 •

edited