Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement idna on top of ICU4X #923

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
a8977a8
Reimplement idna on top of ICU4X
hsivonen Feb 14, 2024
09765af
Add an even faster lower-case ASCII letter path to avoid regressing p…
hsivonen Mar 20, 2024
7e929ce
Comments and verify_dns_length tweak
hsivonen Mar 21, 2024
f413387
Parametrize internal vs. external Punycode caller; restore external A…
hsivonen Mar 21, 2024
71c03b9
Add bench for to_ascii on an already-Punycode name
hsivonen Mar 21, 2024
9af00cb
Avoid re-encoding Punycode when possible
hsivonen Mar 21, 2024
dc8f301
Pass through the input slice in many more cases
hsivonen Mar 21, 2024
41e0192
Add testing for the simultaneous mode
hsivonen Mar 21, 2024
41f2107
Omit the invalid domain character check on the url side
hsivonen Mar 21, 2024
4d7d41a
Document that Punycode labels must result in non-ASCII
hsivonen Mar 21, 2024
98ca752
Rename files called uts46.rs to deprecated.rs
hsivonen Mar 21, 2024
4bbabe9
Rename uts46bis to uts46
hsivonen Mar 21, 2024
7dc0082
Tweak docs
hsivonen Mar 21, 2024
f8eb96e
Avoid useless copying and useless UTF-8 decode
hsivonen Apr 11, 2024
eb6e3d5
Use inline(never) to optimize binary size
hsivonen Apr 15, 2024
ce3d4d1
Split CheckHyphens into a separate concern form the ASCII deny list
hsivonen Apr 16, 2024
6672161
Make the ASCII deny list customizable
hsivonen Apr 18, 2024
90fe4b3
Better docs and top-level functions
hsivonen Apr 18, 2024
50381ff
Parameter for VerifyDNSLength
hsivonen Apr 18, 2024
8268c5a
Restore support for transitional processing to minimize breakage
hsivonen Apr 18, 2024
999bef4
In the deprecated API, use empty deny list with use_std3_ascii_rules=…
hsivonen Apr 18, 2024
b277c85
Tweak docs
hsivonen Apr 18, 2024
980348c
Docs, rename AsciiDenyList::WHATWG to ::URL, tweak top-level functions
hsivonen Apr 19, 2024
4efd589
Use idna crate top-level function in the url crate to dogfood the top…
hsivonen Apr 22, 2024
da6cf50
Add an Usage section to the README
hsivonen Apr 24, 2024
d938024
Add an early return to map_transitional for readability
hsivonen Apr 26, 2024
679edb9
Document internal vs. external Punycode caller differences
hsivonen Apr 26, 2024
4f605c9
Per discussion with Valentin, revert deprecated API to the old behavi…
hsivonen May 3, 2024
bbf4308
Add comments about not fixing deprecated API
hsivonen May 3, 2024
e842dae
Merge branch 'main' into icu4x
hsivonen May 3, 2024
6690c49
Add a comment explaining FailFast in deprecated.rs
hsivonen May 3, 2024
38cedad
For future-proofing, add compiled_data cargo feature (currently alway…
hsivonen May 3, 2024
52137e7
Remove remark about spec violation by making root dot permissibility …
hsivonen May 20, 2024
081f44b
Clarify README about IDNA 2003/2008
hsivonen May 20, 2024
aaa7a40
Add a historical remark to the README
hsivonen May 20, 2024
8b03034
Fix typo
hsivonen May 20, 2024
c8a4bd3
Depend on crates.io versions of icu_normalizer and icu_properties
hsivonen May 23, 2024
be3db8e
Address clippy lints
hsivonen May 23, 2024
6020673
Update versions
hsivonen May 23, 2024
245c514
Increment dependency versions
hsivonen May 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 8 additions & 3 deletions idna/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ doctest = false

[features]
default = ["std"]
std = ["alloc", "unicode-bidi/std", "unicode-normalization/std"]
std = ["alloc"]
alloc = []

[[test]]
Expand All @@ -25,15 +25,20 @@ harness = false
[[test]]
name = "unit"

[[test]]
name = "unitbis"

[dev-dependencies]
assert_matches = "1.3"
bencher = "0.1"
tester = "0.9"
serde_json = "1.0"

[dependencies]
unicode-bidi = { version = "0.3.10", default-features = false, features = ["hardcoded-data"] }
unicode-normalization = { version = "0.1.22", default-features = false }
icu_normalizer = { path = "../../icu4x/components/normalizer", features = ["compiled_data"] }
icu_properties = { path = "../../icu4x/components/properties", features = ["compiled_data"] }
hsivonen marked this conversation as resolved.
Show resolved Hide resolved
utf8_iter = "1.0.4"
smallvec = { version = "1.13.1", features = ["const_generics"]}

[[bench]]
name = "all"
Expand Down
34 changes: 34 additions & 0 deletions idna/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# `idna`

IDNA library for Rust implementing [UTS 46: Unicode IDNA Compatibility Processing](https://www.unicode.org/reports/tr46/) as parametrized by the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna).

## What it does

* An implementation of the non-transitional mode of UTS 46 is provided, both with STD3 rules and WHATWG rules.
* A callback mechanism is provided for pluggable logic for deciding if a label is deemed potentially too misleading to render as Unicode in a user interface.
* Errors are marked as U+FFFD REPLACEMENT CHARACTERs in Unicode output so that locations of errors may be illustrated to the user.

## What it does not do

* There is no default/sample policy provided for the callback mechanism mentioned above.
* Earlier variants of IDNA (2003, 2008) are not implemented—only UTS 46.
* The transitional mode is not supported. The transition is considered to be over: The transitional mode is deprecated in the UTS 46 specification, and the three major browser engines use non-transitional processing.
* There is no API for categorizing errors beyond there being an error.
* Checks that are configurable in UTS 46 but that the WHATWG URL Standard always set a particular way (regardless of the _beStrict_ flag in the URL Standard) cannot be configured.
* The _UseSTD3ASCIIRules_ and _CheckHyphens_ flags cannot be set individually: they are bundled into one setting.
* There is no support for a caller-provided ASCII deny list (there is only the choice between STD3 and WHATWG deny lists).

## Known spec violations

* The `verify_dns_length` behavior that this crate implements allows a trailing dot in the input as required by the UTS 46 test suite despite the UTS 46 spec saying that this isn't allowed.

## Breaking changes since 0.5.0

* Transitional processing is no longer supported. Attempting to enable it panics immediately.
* IDNA 2008 rules are no longer supported. Attempting to enable them panics immediately.
* Setting `check_hyphens` and `use_std3_ascii_rules` to different values is no longer supported. Attempting conversion with such a configuration panics.
* `check_hyphens` now performs the full _CheckHyphens_ check, including rejecting the hyphen in the third and fourth position in a label.
* `domain_to_ascii_strict` now performs the _CheckHyphens_ check (matching previous documentation).
* When `use_std3_ascii_rules` is `false` the [forbidden domain code point](https://url.spec.whatwg.org/#forbidden-domain-code-point) ASCII deny list from the WHATWG URL Standard is now enforced.
* The `Idna::to_ascii_inner` method has been removed. It didn't make sense as a public method, since callers were unable to figure out if there were errors. (A GitHub search found no callers for this method.)
* Punycode labels whose decoding does not yield any non-ASCII characters are now treated as being in error.
7 changes: 7 additions & 0 deletions idna/benches/all.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ fn to_unicode_puny_label(bench: &mut Bencher) {
bench.iter(|| config.to_unicode(black_box(encoded)));
}

fn to_ascii_already_puny_label(bench: &mut Bencher) {
let encoded = "abc.xn--mgbcm";
let config = Config::default();
bench.iter(|| config.to_ascii(black_box(encoded)));
}

fn to_unicode_ascii(bench: &mut Bencher) {
let encoded = "example.com";
let config = Config::default();
Expand Down Expand Up @@ -47,6 +53,7 @@ benchmark_group!(
to_unicode_ascii,
to_unicode_merged_label,
to_ascii_puny_label,
to_ascii_already_puny_label,
to_ascii_simple,
to_ascii_merged,
);
Expand Down
8,727 changes: 0 additions & 8,727 deletions idna/src/IdnaMappingTable.txt

This file was deleted.

189 changes: 189 additions & 0 deletions idna/src/deprecated.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
// Copyright 2013-2014 The rust-url developers.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! [*Unicode IDNA Compatibility Processing*
//! (Unicode Technical Standard #46)](http://www.unicode.org/reports/tr46/)

#![allow(deprecated)]

use alloc::string::String;

use crate::uts46::*;
use crate::Errors;

/// Deprecated. Use the crate-top-level functions or [`Uts46`].
#[derive(Default)]
#[deprecated]
pub struct Idna {
config: Config,
}

impl Idna {
pub fn new(config: Config) -> Self {
Self { config }
}

/// [UTS 46 ToASCII](http://www.unicode.org/reports/tr46/#ToASCII)
#[allow(clippy::wrong_self_convention)]
pub fn to_ascii(&mut self, domain: &str, out: &mut String) -> Result<(), Errors> {
hsivonen marked this conversation as resolved.
Show resolved Hide resolved
match Uts46::new().process(
domain.as_bytes(),
self.config.strictness(),
ErrorPolicy::FailFast,
hsivonen marked this conversation as resolved.
Show resolved Hide resolved
|_, _, _| false,
out,
None,
) {
Ok(ProcessingSuccess::Passthrough) => {
if self.config.verify_dns_length && !verify_dns_length(domain) {
return Err(crate::Errors::default());
}
out.push_str(domain);
Ok(())
}
Ok(ProcessingSuccess::WroteToSink) => {
if self.config.verify_dns_length && !verify_dns_length(out) {
return Err(crate::Errors::default());
}
Ok(())
}
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()),
Err(ProcessingError::SinkError) => unreachable!(),
}
}

/// [UTS 46 ToUnicode](http://www.unicode.org/reports/tr46/#ToUnicode)
#[allow(clippy::wrong_self_convention)]
pub fn to_unicode(&mut self, domain: &str, out: &mut String) -> Result<(), Errors> {
hsivonen marked this conversation as resolved.
Show resolved Hide resolved
match Uts46::new().process(
domain.as_bytes(),
self.config.strictness(),
ErrorPolicy::MarkErrors,
|_, _, _| true,
out,
None,
) {
Ok(ProcessingSuccess::Passthrough) => {
out.push_str(domain);
Ok(())
}
Ok(ProcessingSuccess::WroteToSink) => Ok(()),
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()),
Err(ProcessingError::SinkError) => unreachable!(),
}
}
}

/// Deprecated configuration API.
#[derive(Clone, Copy)]
#[must_use]
#[deprecated]
pub struct Config {
use_std3_ascii_rules: bool,
verify_dns_length: bool,
check_hyphens: bool,
}

/// The defaults are that of _beStrict=false_ in the [WHATWG URL Standard](https://url.spec.whatwg.org/#idna)
impl Default for Config {
hsivonen marked this conversation as resolved.
Show resolved Hide resolved
fn default() -> Self {
Config {
use_std3_ascii_rules: false,
check_hyphens: false,
// Only use for to_ascii, not to_unicode
verify_dns_length: false,
}
}
}

impl Config {
/// Whether to enforce STD3 or WHATWG URL Standard ASCII deny list.
///
/// `true` for STD3, `false` for WHATWG.
///
/// Note that `true` rejects pseudo-hosts used by various TXT record-based protocols.
///
/// Must be set to the same value as [`Config::check_hyphens`].
#[inline]
pub fn use_std3_ascii_rules(mut self, value: bool) -> Self {
self.use_std3_ascii_rules = value;
self
}

/// Obsolete method retained to ease migration. The argument must be `false`.
///
/// Panics
///
/// If the argument is `true`.
#[inline]
#[allow(unused_mut)]
pub fn transitional_processing(mut self, value: bool) -> Self {
assert!(!value, "Transitional processing is no longer supported");
self
}

/// Whether the _VerifyDNSLength_ operation should be performed
/// by `to_ascii`.
#[inline]
pub fn verify_dns_length(mut self, value: bool) -> Self {
self.verify_dns_length = value;
self
}

/// Whether to enforce IETF rules for hyphen placement.
///
/// `true` to deny hyphens in the first, last, third, and fourth
/// position of a label. `false` to not enforce.
///
/// Note that `true` rejects real-world names, including YouTube CDN nodes
/// and some GitHub user pages.
///
/// Must be set to the same value as [`Config::use_std3_ascii_rules`].
#[inline]
pub fn check_hyphens(mut self, value: bool) -> Self {
self.check_hyphens = value;
self
}

/// Obsolete method retained to ease migration. The argument must be `false`.
///
/// Panics
///
/// If the argument is `true`.
#[inline]
#[allow(unused_mut)]
pub fn use_idna_2008_rules(mut self, value: bool) -> Self {
assert!(!value, "IDNA 2008 rules are no longer supported");
self
}

/// Compute strictness
fn strictness(&self) -> Strictness {
assert_eq!(self.check_hyphens, self.use_std3_ascii_rules, "Setting check_hyphens and use_std3_ascii_rules to different values is no longer supported");
if self.use_std3_ascii_rules {
Strictness::Std3ConformanceChecker
} else {
Strictness::WhatwgUserAgent
}
}

/// [UTS 46 ToASCII](http://www.unicode.org/reports/tr46/#ToASCII)
pub fn to_ascii(self, domain: &str) -> Result<String, Errors> {
let mut result = String::with_capacity(domain.len());
let mut codec = Idna::new(self);
codec.to_ascii(domain, &mut result).map(|()| result)
}

/// [UTS 46 ToUnicode](http://www.unicode.org/reports/tr46/#ToUnicode)
pub fn to_unicode(self, domain: &str) -> (String, Result<(), Errors>) {
let mut codec = Idna::new(self);
let mut out = String::with_capacity(domain.len());
let result = codec.to_unicode(domain, &mut out);
(out, result)
}
}
93 changes: 78 additions & 15 deletions idna/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,41 +46,104 @@ compile_error!("the `alloc` feature must be enabled");
#[macro_use]
extern crate assert_matches;

use alloc::borrow::Cow;
use alloc::string::String;
use uts46::Uts46;

pub mod punycode;
mod uts46;
mod deprecated;
pub mod uts46;

pub use crate::uts46::{Config, Errors, Idna};
#[allow(deprecated)]
pub use crate::deprecated::{Config, Idna};

/// The [domain to ASCII](https://url.spec.whatwg.org/#concept-domain-to-ascii) algorithm.
/// Type indicating that there were errors during UTS #46 processing.
#[derive(Default, Debug)]
#[non_exhaustive]
pub struct Errors {}

impl From<Errors> for Result<(), Errors> {
fn from(e: Errors) -> Result<(), Errors> {
Err(e)
}
}

#[cfg(feature = "std")]
impl std::error::Error for Errors {}

impl core::fmt::Display for Errors {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
core::fmt::Debug::fmt(self, f)
}
}

/// The [domain to ASCII](https://url.spec.whatwg.org/#concept-domain-to-ascii) algorithm;
/// version returning a `Cow`.
///
/// Return the ASCII representation a domain name,
/// normalizing characters (upper-case to lower-case and other kinds of equivalence)
/// and using Punycode as necessary.
///
/// This process may fail.
pub fn domain_to_ascii(domain: &str) -> Result<String, uts46::Errors> {
Config::default().to_ascii(domain)
pub fn domain_to_ascii_cow<'a>(domain: &'a str) -> Result<Cow<'a, str>, Errors> {
Uts46::new().to_ascii(domain.as_bytes(), uts46::Strictness::WhatwgUserAgent)
}

/// The [domain to ASCII](https://url.spec.whatwg.org/#concept-domain-to-ascii) algorithm;
/// version returning `String`. See also [`domain_to_ascii_cow`].
///
/// Return the ASCII representation a domain name,
/// normalizing characters (upper-case to lower-case and other kinds of equivalence)
/// and using Punycode as necessary.
///
/// This process may fail.
pub fn domain_to_ascii(domain: &str) -> Result<String, Errors> {
domain_to_ascii_cow(domain).map(|cow| cow.into_owned())
}

/// The [domain to ASCII](https://url.spec.whatwg.org/#concept-domain-to-ascii) algorithm,
/// with the `beStrict` flag set.
pub fn domain_to_ascii_strict(domain: &str) -> Result<String, uts46::Errors> {
Config::default()
.use_std3_ascii_rules(true)
.verify_dns_length(true)
.to_ascii(domain)
///
/// Note that this rejects various real-world names including:
/// * YouTube CDN nodes
/// * Some GitHub user pages
/// * Pseudo-hosts used by various TXT record-based protocols.
pub fn domain_to_ascii_strict(domain: &str) -> Result<String, Errors> {
Uts46::new()
.to_ascii(
domain.as_bytes(),
uts46::Strictness::Std3ConformanceChecker,
)
.map(|cow| cow.into_owned())
}

/// The [domain to Unicode](https://url.spec.whatwg.org/#concept-domain-to-unicode) algorithm;
/// version returning a `Cow`.
///
/// Return the Unicode representation of a domain name,
/// normalizing characters (upper-case to lower-case and other kinds of equivalence)
/// and decoding Punycode as necessary.
///
/// If the second item of the tuple indicates an error, the first item of the tuple
/// denotes errors using the REPLACEMENT CHARACTERs in order to be able to illustrate
/// errors to the user. When the second item of the return tuple signals an error,
/// the first item of the tuple must not be used in a network protocol.
pub fn domain_to_unicode_cow<'a>(domain: &'a str) -> (Cow<'a, str>, Result<(), Errors>) {
Uts46::new().to_unicode(domain.as_bytes(), uts46::Strictness::WhatwgUserAgent)
}

/// The [domain to Unicode](https://url.spec.whatwg.org/#concept-domain-to-unicode) algorithm.
/// The [domain to Unicode](https://url.spec.whatwg.org/#concept-domain-to-unicode) algorithm;
/// version returning `String`. See also [`domain_to_unicode_cow`].
///
/// Return the Unicode representation of a domain name,
/// normalizing characters (upper-case to lower-case and other kinds of equivalence)
/// and decoding Punycode as necessary.
///
/// This may indicate [syntax violations](https://url.spec.whatwg.org/#syntax-violation)
/// but always returns a string for the mapped domain.
pub fn domain_to_unicode(domain: &str) -> (String, Result<(), uts46::Errors>) {
Config::default().to_unicode(domain)
/// If the second item of the tuple indicates an error, the first item of the tuple
/// denotes errors using the REPLACEMENT CHARACTERs in order to be able to illustrate
/// errors to the user. When the second item of the return tuple signals an error,
/// the first item of the tuple must not be used in a network protocol.
pub fn domain_to_unicode(domain: &str) -> (String, Result<(), Errors>) {
let (cow, result) = domain_to_unicode_cow(domain);
(cow.into_owned(), result)
}