Skip to content

Commit

Permalink
bug #280 Don't force labels containing URL delimiters to stay in thei…
Browse files Browse the repository at this point in the history
…r Unicode form when using idn_to_ascii (TRowbotham)

This PR was squashed before being merged into the 1.18-dev branch.

Discussion
----------

Don't force labels containing URL delimiters to stay in their Unicode form when using idn_to_ascii

Fixes #279.

The problem is this line: https://github.com/symfony/polyfill/blob/master/src/Intl/Idn/Idn.php#L162 and only affects `idn_to_ascii`. Strictly speaking, this isn't part of the spec. The fix is to remove this conditional statement.

The reason why it is here is an assumption that I made regarding a discrepancy between what the spec produces and what the test cases expect. The problem that I was trying to solve is that the tests expect that a label that contains a "?" and also has an error should remain in it's Unicode form, however, the spec says to always convert the label to it's ASCII form. There are roughly 200 test cases where this is the case. As a result of this, I extrapolated this to mean that labels containing URL delimiters and had an error should stay in their Unicode form, as it seemed odd that "?" would be singled out here, however, this was clearly an incorrect assumption as shown by the simple test provided in #279.

Example problematic test case:
```diff
213) Rowbot\Idna\Test\IdnaV2Test::testToAsciiTransitional with data set #6224 ('憡?Ⴔ.XN--1UG73GL146A', '憡?Ⴔ.𐋮‍≠', '[C2, P1, V6]', '憡?Ⴔ.xn--1ug73gl146a', '[C2, P1, V6, A3]', '', '')
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'憡?Ⴔ.xn--1ug73gl146a'
+'xn--?-c1g3623d.xn--1ug73gl146a'
```

These errors do not appear in the Symfony tests because we opt not to check the transformed domain when it contains errors, which is what the official [ICU test suite](https://github.com/unicode-org/icu/blob/master/icu4j/main/tests/core/src/com/ibm/icu/dev/test/normalizer/UTS46Test.java#L749) does, but this is likely the reason why the discrepancy exists in the first place.

I have notified the Unicode Consortium about the test case discrepancy and have been told that it will be discussed at the next Unicode Technical Committee meeting.

Commits
-------

43fbe88 Don't force labels containing URL delimiters to stay in their Unicode form when using idn_to_ascii
  • Loading branch information
fabpot committed Aug 4, 2020
2 parents 752850a + 43fbe88 commit aa691e9
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 7 deletions.
8 changes: 1 addition & 7 deletions src/Intl/Idn/Idn.php
Original file line number Diff line number Diff line change
Expand Up @@ -155,14 +155,8 @@ public static function idn_to_ascii($domainName, $options = IDNA_DEFAULT, $varia
$labels = self::process((string) $domainName, $options, $info);

foreach ($labels as $i => $label) {
// Only convert labels to punycode that contain non-ASCII code points and only if that
// label does not contain a character from the gen-delims set specified in
// {@link https://ietf.org/rfc/rfc3987.html#section-2.2}
// Only convert labels to punycode that contain non-ASCII code points
if (1 === preg_match('/[^\x00-\x7F]/', $label)) {
if (false !== strpbrk($label, ':/?#[]@')) {
continue;
}

try {
$label = 'xn--'.self::punycodeEncode($label);
} catch (Exception $e) {
Expand Down
4 changes: 4 additions & 0 deletions tests/Intl/Idn/IdnTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,10 @@ public function domainNamesProvider()
'άέήίΰαβγδεζηθικλμνξοπρσστυφχ.com',
'xn--hxacdefghijklmnopqrstuvw0caz0a1a2a.com',
),
array(
'test@bücher.de',
'xn--test@bcher-feb.de',
),
);
}

Expand Down

0 comments on commit aa691e9

Please sign in to comment.