IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

SimonSapin · 2016-04-03T21:57:22Z

https://url.spec.whatwg.org/#idna refers (through the “Unicode ToAscii” and “Unicode ToUnicode” algorithms) to http://www.unicode.org/reports/tr46/#Processing and rely on the error flag.

The following steps, performed in order, successively alter the input domain_name string and then output it as a converted Unicode string, plus a flag to indicate whether there was an error.

This it turns refers to Section 4.1 http://www.unicode.org/reports/tr46/#Validity_Criteria which has a series of “must” requirements. For example:

The label must be in Unicode Normalization Form NFC.

This section also has a subsection 4.1.2 http://www.unicode.org/reports/tr46/#Right_to_Left_Scripts

In addition, the label should meet the requirements for right-to-left characters specified in the Right-to-Left Scripts document of [IDNA2008], and for the CONTEXTJ requirements in the Protocol document of [IDNA2008]. It is strongly recommended that Unicode Technical Report #36, Unicode Security Considerations [UTR36] and Unicode Technical Standard #39, Unicode Security Mechanisms [UTS39] be consulted for information on dealing with confusables, and for characters that should be excluded from identifiers. Note that the recommended exclusions are a superset of those in [IDNA2008].

Note “should” (emphasis added) and “strongly recommended” rather than “must”.

If the URL Standard is to define interoperable algorithms, I think it needs to define in which requirements Section 4.1.2 sets the error flag.

(Related: servo/rust-url#179)

annevk · 2016-04-04T07:27:08Z

Interesting, yeah, I don't think that should be enforced. We should probably make this configurable in the IDNA standard.

I noticed the line you quoted mentioned transitional processing. It seems Gecko is successfully shipping non-transitional processing these days. Perhaps Servo can do so too? And maybe the URL Standard should start requiring that? It's still rather contentious whether it's a good idea though...

SimonSapin · 2016-04-04T08:56:38Z

I noticed the line you quoted mentioned transitional processing. It seems Gecko is successfully shipping non-transitional processing these days. Perhaps Servo can do so too? And maybe the URL Standard should start requiring that? It's still rather contentious whether it's a good idea though...

What do other browsers do?

CC @valenting

annevk · 2016-04-04T08:58:19Z

@SimonSapin all other browsers do transitional as far as I know. See https://bugzilla.mozilla.org/show_bug.cgi?id=1218179 and https://bugzilla.mozilla.org/show_bug.cgi?id=1255188 for details.

https://bugs.webkit.org/show_bug.cgi?id=144194 Reviewed by Darin Adler. Source/WebCore: Use uidna_nameToASCII instead of the deprecated uidna_IDNToASCII. It uses IDN2008 instead of IDN2003, and it uses UTF #46 when used with a UIDNA opened with uidna_openUTS46. This follows https://url.spec.whatwg.org/#concept-domain-to-ascii except we do not use Transitional_Processing to prevent homograph attacks on german domain names with "ß" and "ss" in them. These are now treated as separate domains. Firefox also doesn't use Transitional_Processing. Chrome and the current specification use Transitional_processing, but whatwg/url#110 might change the spec. In addition, http://unicode.org/reports/tr46/ says: "implementations are encouraged to apply the Bidi and ContextJ validity criteria" Bidi checks prevent domain names with bidirectional text, such as latin and hebrew characters in the same domain. Chrome and Firefox do this. ContextJ checks prevent code points such as U+200D, which is a zero-width joiner which users would not see when looking at the domain name. Firefox currently enables ContextJ checks and it is suggested by UTS #46, so we'll do it. ContextO checks, which we do not use and neither does any other browser nor the spec, would fail if a domain contains code points such as U+30FB, which looks somewhat like a dot. We can investigate enabling these checks later. Covered by new API tests and rebased LayoutTests. The new API tests verify that we do not use transitional processing, that we do apply the Bidi and ContextJ checks, but not ContextO checks. * platform/URLParser.cpp: (WebCore::URLParser::domainToASCII): (WebCore::URLParser::internationalDomainNameTranscoder): * platform/URLParser.h: * platform/mac/WebCoreNSURLExtras.mm: (WebCore::mapHostNameWithRange): Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): Add some tests from http://unicode.org/faq/idn.html verifying that we follow UTS46's deviations from IDN2008. Add some tests based on https://tools.ietf.org/html/rfc5893 verifying that we check for bidirectional text. Add a test based on https://tools.ietf.org/html/rfc5892 verifying that we do not do ContextO check. Add a test for U+321D and U+321E which have particularly interesting punycode encodings. We match Firefox here now. Also add a test from http://www.unicode.org/reports/tr46/#IDNAComparison verifying we are not using IDN2003. We should consider importing all of http://www.unicode.org/Public/idna/9.0.0/IdnaTest.txt as URL domain tests. LayoutTests: * fast/encoding/idn-security.html: Move some characters with changed IDN encodings to inside the check for old ICU. * fast/url/idna2003-expected.txt: * fast/url/idna2008-expected.txt: Update expected results. We are now more compliant with IDN2008. git-svn-id: http://svn.webkit.org/repository/webkit/trunk@208902 268f45cc-cd09-0410-ab3c-d52691b4dbfc

achristensen07 · 2016-11-19T05:22:15Z

WebKit just switched to non-transitional processing and added tests verifying that we do Bidi checks and ContextJ checks. We don't do ContextO checks because nobody else does yet. See https://bugs.webkit.org/show_bug.cgi?id=144194

terinjokes · 2017-03-08T04:09:02Z

Just an update on user-agent support from a random user who happened to Googlewhack to this thread (edit: sorry, I seem to have gone a tad off topic):

✅ As mentioned by @achristensen07, WebKit seems to be using some form of non-transitional processing. It looks like this has been picked up by Safari Technical Preview as well.
✅ Firefox has landed this in stable at some point in 2016.
🔴 Chrome's most recent ticket 505262 is currently WONTFIX and they might reconsider at some point in 2017. (see updated below)
🔴 Edge hasn't had an update in a year now 6818768. (see updated below)

In a quick check of development environments I've used in the last 24 hours (transparent support via the language's HTTP interfaces are untested):

✅ Node.js has been using non-transitional IDN since has least 0.10.41 (the oldest I have installed).

🔶 PHP supports since 5.4, though IDN2003 is still the (deprecated) default. One probably wants:

idn_to_ascii('gießen.xx', IDNA_NONTRANSITIONAL_TO_ASCII | IDNA_CHECK_BIDI | IDNA_CHECK_CONTEXTJ, INTL_IDNA_VARIANT_UTS46);

✅ In Go, golang.org/net/x/idna does non-transitional processing. Interestingly, the github.com/miekg/dns/idn package also does non-transitional processing for the inputs I've tested, despite the documentation saying it implements IDN2003.

annevk · 2017-03-08T06:17:18Z

There's more recent tickets filed by myself: https://bugs.chromium.org/p/chromium/issues/detail?id=694157 and https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/11009037/.

terinjokes · 2017-03-08T07:01:16Z

@annevk Thanks! I missed those while Googling.

annevk · 2017-03-08T08:04:04Z

See also #263 for issues with Python that may well exist in other implementations. Interoperability issues for everyone.

annevk · 2017-05-18T09:35:15Z

FWIW, I'm pretty sure CONTEXTJ must be false, otherwise 👩‍⚕️ cannot be represented whereas that works fine in user agents. (As seen in http://www.unicode.org/reports/tr46/tr46-18.html#Validity_Criteria.)

annevk · 2017-05-18T10:12:02Z

Hmm, maybe that's wrong. Safari definitely doesn't seem to do the same thing as Firefox for CONTEXTJ though per https://trac.webkit.org/changeset/208902/webkit it should? @achristensen07 any insights? I kinda thing we should allow CONTEXTJ if we allow emojis for subdomains. Banning a subset of emojis seems a little weird.

annevk · 2017-05-19T13:04:26Z

http://www.unicode.org/reports/tr46/proposed.html#Processing has these now as input flags so they're no longer should requirements. My limited testing shows CheckBidi should be true. For CheckJoiners the results are unclear. Input welcome.

annevk · 2017-05-19T13:05:52Z

And in case it wasn't clear, Nontransitional_Processing started to be used in the URL Standard since #239.

Fixes #53 and fixes #267 by no longer breaking on on hyphens in the 3rd and 4th position of a domain label. This is known to break YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting the proposed CheckHyphens flag to false. Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done by setting the proposed CheckBidi and CheckJoiners flags to true. Follow-up #313 is filed to remove the proposed bits once Unicode is updated.

Tests: web-platform-tests/wpt#5976. Fixes #53 and fixes #267 by no longer breaking on on hyphens in the 3rd and 4th position of a domain label. This is known to break YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting the proposed CheckHyphens flag to false. Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done by setting the proposed CheckBidi and CheckJoiners flags to true. Follow-up #313 is filed to remove the proposed bits once Unicode is updated.

annevk · 2017-05-24T11:16:16Z

It seems that most user agents enforce CheckJoiners if I don't check the more problematic emoji case. So I'll go with that.

SimonSapin mentioned this issue Apr 3, 2016

Fails to parse already punycoded emoji servo/rust-url#179

Closed

annevk added the topic: parser label Dec 20, 2016

annevk mentioned this issue Jan 31, 2017

IDNA2008 #223

Closed

annevk mentioned this issue Feb 9, 2017

IDNA Nontransitional_Processing #239

Closed

annevk added the topic: idna label Feb 10, 2017

annevk changed the title ~~IDNA / UTS #46 "should" requirements~~ IDNA / UTS #46 "should" requirements (Bidi and Joiners) May 19, 2017

annevk mentioned this issue May 20, 2017

Address several IDNA issues #309

Merged

domenic closed this as completed in dc9d831 Jun 1, 2017

swankjesse mentioned this issue Jan 12, 2022

Upgrade from IDN2003 to UTS #46 Non-transitional square/okhttp#7008

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

SimonSapin commented Apr 3, 2016

annevk commented Apr 4, 2016

SimonSapin commented Apr 4, 2016

annevk commented Apr 4, 2016

achristensen07 commented Nov 19, 2016

terinjokes commented Mar 8, 2017 •

edited

annevk commented Mar 8, 2017

terinjokes commented Mar 8, 2017

annevk commented Mar 8, 2017

annevk commented May 18, 2017

annevk commented May 18, 2017

annevk commented May 19, 2017

annevk commented May 19, 2017

annevk commented May 24, 2017

IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

Comments

SimonSapin commented Apr 3, 2016

annevk commented Apr 4, 2016

SimonSapin commented Apr 4, 2016

annevk commented Apr 4, 2016

achristensen07 commented Nov 19, 2016

terinjokes commented Mar 8, 2017 • edited

annevk commented Mar 8, 2017

terinjokes commented Mar 8, 2017

annevk commented Mar 8, 2017

annevk commented May 18, 2017

annevk commented May 18, 2017

annevk commented May 19, 2017

annevk commented May 19, 2017

annevk commented May 24, 2017

terinjokes commented Mar 8, 2017 •

edited