Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

Closed
SimonSapin opened this issue Apr 3, 2016 · 13 comments
Closed

IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110

SimonSapin opened this issue Apr 3, 2016 · 13 comments

Comments

@SimonSapin
Copy link
Contributor

https://url.spec.whatwg.org/#idna refers (through the “Unicode ToAscii” and “Unicode ToUnicode” algorithms) to http://www.unicode.org/reports/tr46/#Processing and rely on the error flag.

The following steps, performed in order, successively alter the input domain_name string and then output it as a converted Unicode string, plus a flag to indicate whether there was an error.

This it turns refers to Section 4.1 http://www.unicode.org/reports/tr46/#Validity_Criteria which has a series of “must” requirements. For example:

The label must be in Unicode Normalization Form NFC.

This section also has a subsection 4.1.2 http://www.unicode.org/reports/tr46/#Right_to_Left_Scripts

In addition, the label should meet the requirements for right-to-left characters specified in the Right-to-Left Scripts document of [IDNA2008], and for the CONTEXTJ requirements in the Protocol document of [IDNA2008]. It is strongly recommended that Unicode Technical Report #36, Unicode Security Considerations [UTR36] and Unicode Technical Standard #39, Unicode Security Mechanisms [UTS39] be consulted for information on dealing with confusables, and for characters that should be excluded from identifiers. Note that the recommended exclusions are a superset of those in [IDNA2008].

Note “should” (emphasis added) and “strongly recommended” rather than “must”.

If the URL Standard is to define interoperable algorithms, I think it needs to define in which requirements Section 4.1.2 sets the error flag.

(Related: servo/rust-url#179)

@annevk
Copy link
Member

annevk commented Apr 4, 2016

Interesting, yeah, I don't think that should be enforced. We should probably make this configurable in the IDNA standard.

I noticed the line you quoted mentioned transitional processing. It seems Gecko is successfully shipping non-transitional processing these days. Perhaps Servo can do so too? And maybe the URL Standard should start requiring that? It's still rather contentious whether it's a good idea though...

@SimonSapin
Copy link
Contributor Author

I noticed the line you quoted mentioned transitional processing. It seems Gecko is successfully shipping non-transitional processing these days. Perhaps Servo can do so too? And maybe the URL Standard should start requiring that? It's still rather contentious whether it's a good idea though...

What do other browsers do?

CC @valenting

@annevk
Copy link
Member

annevk commented Apr 4, 2016

@SimonSapin all other browsers do transitional as far as I know. See https://bugzilla.mozilla.org/show_bug.cgi?id=1218179 and https://bugzilla.mozilla.org/show_bug.cgi?id=1255188 for details.

kisg pushed a commit to paul99/webkit-mips that referenced this issue Nov 19, 2016
https://bugs.webkit.org/show_bug.cgi?id=144194

Reviewed by Darin Adler.

Source/WebCore:

Use uidna_nameToASCII instead of the deprecated uidna_IDNToASCII.
It uses IDN2008 instead of IDN2003, and it uses UTF #46 when used with a UIDNA opened with uidna_openUTS46.
This follows https://url.spec.whatwg.org/#concept-domain-to-ascii except we do not use Transitional_Processing
to prevent homograph attacks on german domain names with "ß" and "ss" in them.  These are now treated as separate domains.
Firefox also doesn't use Transitional_Processing. Chrome and the current specification use Transitional_processing,
but whatwg/url#110 might change the spec.
        
In addition, http://unicode.org/reports/tr46/ says:
"implementations are encouraged to apply the Bidi and ContextJ validity criteria"
Bidi checks prevent domain names with bidirectional text, such as latin and hebrew characters in the same domain.  Chrome and Firefox do this.

ContextJ checks prevent code points such as U+200D, which is a zero-width joiner which users would not see when looking at the domain name.
Firefox currently enables ContextJ checks and it is suggested by UTS #46, so we'll do it.

ContextO checks, which we do not use and neither does any other browser nor the spec, would fail if a domain contains code points such as U+30FB,
which looks somewhat like a dot.  We can investigate enabling these checks later.

Covered by new API tests and rebased LayoutTests.
The new API tests verify that we do not use transitional processing, that we do apply the Bidi and ContextJ checks, but not ContextO checks.

* platform/URLParser.cpp:
(WebCore::URLParser::domainToASCII):
(WebCore::URLParser::internationalDomainNameTranscoder):
* platform/URLParser.h:
* platform/mac/WebCoreNSURLExtras.mm:
(WebCore::mapHostNameWithRange):

Tools:

* TestWebKitAPI/Tests/WebCore/URLParser.cpp:
(TestWebKitAPI::TEST_F):
Add some tests from http://unicode.org/faq/idn.html verifying that we follow UTS46's deviations from IDN2008.
Add some tests based on https://tools.ietf.org/html/rfc5893 verifying that we check for bidirectional text.
Add a test based on https://tools.ietf.org/html/rfc5892 verifying that we do not do ContextO check.
Add a test for U+321D and U+321E which have particularly interesting punycode encodings.  We match Firefox here now.
Also add a test from http://www.unicode.org/reports/tr46/#IDNAComparison verifying we are not using IDN2003.
We should consider importing all of http://www.unicode.org/Public/idna/9.0.0/IdnaTest.txt as URL domain tests.

LayoutTests:

* fast/encoding/idn-security.html:
Move some characters with changed IDN encodings to inside the check for old ICU.
* fast/url/idna2003-expected.txt:
* fast/url/idna2008-expected.txt:
Update expected results.  We are now more compliant with IDN2008.



git-svn-id: http://svn.webkit.org/repository/webkit/trunk@208902 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@achristensen07
Copy link
Collaborator

WebKit just switched to non-transitional processing and added tests verifying that we do Bidi checks and ContextJ checks. We don't do ContextO checks because nobody else does yet. See https://bugs.webkit.org/show_bug.cgi?id=144194

@terinjokes
Copy link

terinjokes commented Mar 8, 2017

Just an update on user-agent support from a random user who happened to Googlewhack to this thread (edit: sorry, I seem to have gone a tad off topic):

  • ✅ As mentioned by @achristensen07, WebKit seems to be using some form of non-transitional processing. It looks like this has been picked up by Safari Technical Preview as well.
  • ✅ Firefox has landed this in stable at some point in 2016.
  • 🔴 Chrome's most recent ticket 505262 is currently WONTFIX and they might reconsider at some point in 2017. (see updated below)
  • 🔴 Edge hasn't had an update in a year now 6818768. (see updated below)

In a quick check of development environments I've used in the last 24 hours (transparent support via the language's HTTP interfaces are untested):

  • ✅ Node.js has been using non-transitional IDN since has least 0.10.41 (the oldest I have installed).

  • 🔶 PHP supports since 5.4, though IDN2003 is still the (deprecated) default. One probably wants:

    idn_to_ascii('gießen.xx', IDNA_NONTRANSITIONAL_TO_ASCII | IDNA_CHECK_BIDI | IDNA_CHECK_CONTEXTJ, INTL_IDNA_VARIANT_UTS46);
  • ✅ In Go, golang.org/net/x/idna does non-transitional processing. Interestingly, the github.com/miekg/dns/idn package also does non-transitional processing for the inputs I've tested, despite the documentation saying it implements IDN2003.

@annevk
Copy link
Member

annevk commented Mar 8, 2017

@terinjokes
Copy link

@annevk Thanks! I missed those while Googling.

@annevk
Copy link
Member

annevk commented Mar 8, 2017

See also #263 for issues with Python that may well exist in other implementations. Interoperability issues for everyone.

@annevk
Copy link
Member

annevk commented May 18, 2017

FWIW, I'm pretty sure CONTEXTJ must be false, otherwise 👩‍⚕️ cannot be represented whereas that works fine in user agents. (As seen in http://www.unicode.org/reports/tr46/tr46-18.html#Validity_Criteria.)

@annevk
Copy link
Member

annevk commented May 18, 2017

Hmm, maybe that's wrong. Safari definitely doesn't seem to do the same thing as Firefox for CONTEXTJ though per https://trac.webkit.org/changeset/208902/webkit it should? @achristensen07 any insights? I kinda thing we should allow CONTEXTJ if we allow emojis for subdomains. Banning a subset of emojis seems a little weird.

@annevk
Copy link
Member

annevk commented May 19, 2017

http://www.unicode.org/reports/tr46/proposed.html#Processing has these now as input flags so they're no longer should requirements. My limited testing shows CheckBidi should be true. For CheckJoiners the results are unclear. Input welcome.

@annevk
Copy link
Member

annevk commented May 19, 2017

And in case it wasn't clear, Nontransitional_Processing started to be used in the URL Standard since #239.

@annevk annevk changed the title IDNA / UTS #46 "should" requirements IDNA / UTS #46 "should" requirements (Bidi and Joiners) May 19, 2017
annevk added a commit that referenced this issue May 24, 2017
Fixes #53 and fixes #267 by no longer breaking on on hyphens in the
3rd and 4th position of a domain label. This is known to break
YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting
the proposed CheckHyphens flag to false.

Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done
by setting the proposed CheckBidi and CheckJoiners flags to true.

Follow-up #313 is filed to remove the proposed bits once Unicode is
updated.
annevk added a commit that referenced this issue May 24, 2017
Tests: web-platform-tests/wpt#5976.

Fixes #53 and fixes #267 by no longer breaking on on hyphens in the
3rd and 4th position of a domain label. This is known to break
YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting
the proposed CheckHyphens flag to false.

Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done
by setting the proposed CheckBidi and CheckJoiners flags to true.

Follow-up #313 is filed to remove the proposed bits once Unicode is
updated.
@annevk
Copy link
Member

annevk commented May 24, 2017

It seems that most user agents enforce CheckJoiners if I don't check the more problematic emoji case. So I'll go with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants