Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDN polyfill for the UTS46 variant is incomplete #159

Closed
stof opened this issue Oct 23, 2018 · 10 comments · Fixed by #267
Closed

IDN polyfill for the UTS46 variant is incomplete #159

stof opened this issue Oct 23, 2018 · 10 comments · Fixed by #267

Comments

@stof
Copy link
Member

stof commented Oct 23, 2018

See #149 (comment) and following comments

@GrahamCampbell
Copy link
Contributor

GrahamCampbell commented Jan 12, 2020

It also seems to not throw an error when there's a leading dash that is not allowed. See guzzle/guzzle#2550 (comment).

@nicolas-grekas
Copy link
Member

Anyone willing to improve this?

@TRowbotham
Copy link
Contributor

TRowbotham commented Jun 3, 2020

Anyone willing to improve this?

At the risk of shooting myself in the foot, I'd be willing to donate the library I just finished to improve the situation here.

Overview

I wrote this for my own use after having tried other polyfills and being disappointed with the results. You can find the repository here for testing, though I have not added it to Packagist yet contrary to the install instructions in the readme. The library is fully UTS#46 compliant and passes the test suites for Unicode versions 12.1.0 and 13.0.0 (more details below). Code coverage is roughly 95%.

I understand that swapping the underlying library here isn't really the most palatable solution to the problem, but still I wanted to offer. IDNA is a rather complex topic and doesn't really have a simple solution. I'll be happy to answer any questions you may have and I'll try to provide you with enough information to make an informed decision. Right now, the Normalizer implementation used is the wildcard, and as far as I can tell there are only 2 existing polyfills available; symfony/polyfill-intl-normalizer and wikimedia/utfnormal.

Tests

The table below contains preliminary test results. This was run against the IDNA test suite for Unicode version 13.0.0. There are a total of 6,225 test cases. The tests are run 3 times; once for toUnicode, once for toASCII with Transitional Processing disabled, and once for toASCII with Transitional Processing enabled. This results in a total of 18,675 tests. The "Normalizer + Unicode version" column lists the Normalizer implementation used and the Unicode version that the Normalizer implementation is using.

IDNA implementation Normalizer + Unicode version Assertions Errors Failures
symfony/polyfill-intl-idn N/A 27,112 7 17,803
rowbot/idna symfony/polyfill-intl-normalizer + 6.3.0[1] 37,350 0 1,195
rowbot/idna wikimedia/utfnormal + 8.0.0[2] 37,350 0 444
rowbot/idna wikimedia/utfnormal + 13.0.0[3] 37,056 0 746
rowbot/idna ext-intl + 13.0.0 37,350 0 0
rowbot/idna ext-intl + 12.1.0 37,350 0 0
rowbot/idna ext-intl + ??[4] 37,344 0 6

[1] From the source comment I assumed this was version 6.3.0.
[2] The docs don't state what version of Unicode it comes packaged with, but commit comments indicate that the build scripts were last updated for Unicode 8.0.0.
[3] The docs state that you can fetch the latest data files, which I did, but the results were worse. I assume that either the format of the data files changed or the normalization spec changed since the library was last updated.
[4] Based off a travis build. I'm not sure what version of Unicode is being used in the travis environment, but it is less than 12.1.0. Likely somewhere between 10.0.0 and 12.0.0 if I had to guess.

I will file a new issue with the error details from running symfony/polyfill-intl-idn to try to reduce the amount of information in this post.

While the results with a Normalizer implementation that uses an appropriate version of Unicode are very promising, and great for validating conformance, they are essentially irrelevant since the use case for installing this is that the user does not have the intl extension installed.

I understand the desire to have a test suite that 100% passes and that is what is expected from a quick look at the docs, but unfortunately that can't be guaranteed here with the available Normalizer polyfills. So, I don't know how you feel about having a test suite that may or may not pass depending on the Normalizer's Unicode version.

Performance

Could be better, could be worse. Sorry, the details in this department are a bit light and unscientific.

OPCache and PCRE JIT were enabled, but the autoloader was not optimized.

The fastest run (rowbot/idna with ext-intl Normalizer) takes ~2 seconds to run the test suite on my computer. The worst case was symfony/polyfill-intl-idn clocking in at ~3.5 seconds, likely due to the sheer number of errors (is phpunit supposed to be counting time/memory it uses for diffing/building error messages? phpunit also reported memory usage of ~470Mb in this case, which seems odd if its not supposed to include the memory it's using for error strings). I know relying on the test suite run times for performance metrics is bad, but I don't have any other metrics to provide at this time.

The Unicode data is an array with ~8,700 entries that represents the entire repertoire of Unicode code points. A binary search is done on this array to lookup the information for each code point in the input, but nothing is cached. So, could possibly be improved with caching lookup results, but determining the best heuristic for cache eviction and all that is challenging.

Preloading the Unicode data in PHP 7.4+ could be a potential win.

I haven't actually checked performance in PHP 8 yet, but the JIT may help with all the binary searches.

Implications

My library ships with Unicode data for a specific version of Unicode (in this case it is using 13.0.0). While this is great for helping to ensure the stability of the resulting output, it does come with some drawbacks:

  • I have not thought ahead enough about how versioning might work. Do you increase the major version every time you bump the included Unicode version or is incrementing the minor version sufficient? Its possible that nothing of relevance changes between Unicode versions.
  • If the format of the data files change, the build scripts will need to be adjusted accordingly.
  • If the format of the test data changes, the tests will need to be updated accordingly.
  • I've done my best to not rely on features that may not exist, but my library does depend on PCRE being compiled with Unicode support. While this could be avoided relatively easily, it would come at the cost of performance.
  • Although we require PCRE to have Unicode support, we can't rely on things like /\p{M}/u to give a consistent anwser as it changes based on the version of PCRE, so we are forced to use a bunch of really large, messy, script generated regular expressions to work around it and its not the easiest thing debug.

Additionally,

  • It would raise the minimum version of PHP for this polyfill from 5.3.3 to 7.1.
  • My library uses Idna instead of Idn and as a result anyone depending on the internal class name would have issues unless you rename it.

Differences

I've tried very hard for parity between my library and the native ICU implementation, however, there are some differences.

  • ICU follows the recommendation of marking invalid labels with U+FFFD code points, but my library does not do this currently as it is not required and the tests would need to be made to do wildcard matching.
  • The UTS#46 specification only requires implementations to record that there was an error and does not require the reproduction of exact error codes. My library does a relatively good job of matching the ICU implementation in terms of exact error codes, but does not offer 100% parity in this regard.

Conclusion

First, let me apologize for the headache you now have. This is probably a lot to take in and is far from a perfect solution. In a perfect world, everyone would have access to the intl extension that wasn't built with a horribly outdated version of ICU. It would be awesome if symfony/polyfill-intl-normalizer was updated to a more recent version of Unicode. I am open to feedback. Feel free to ask questions. Thank you for your time. /end brain dump

EDIT: Fixed the number of assertions for the last row in the tests table.

@nicolas-grekas
Copy link
Member

Wow, nice :)

Would you be up to sending your code in a PR on this repository, replacing the current one?

Do you increase the major version every time you bump the included Unicode version or is incrementing the minor version sufficient?

We considered this as bug fixes in the past. For sure a major version bump would not be desired.

I've done my best to not rely on features that may not exist, but my library does depend on PCRE being compiled with Unicode support.

that's fine, we already do this hypothesis in other polyfills and it proved being a non-issue.

Although we require PCRE to have Unicode support, we can't rely on things like /\p{M}/u

Can't we really? Did you spot any difference? If yes, at which version of PCRE/PHP? I'd be tempted to consider this as an issue we don't need to work around. Let's use /\p{M}/u, unless we have some data that tells us that a significant population would be impacted.

It would raise the minimum version of PHP for this polyfill from 5.3.3 to 7.1.

Can this be relaxed? Which 7.1 features are required?

My library uses Idna instead of Idn and as a result anyone depending on the internal class name would have issues unless you rename it.

Don't you want to rename your class to Idn? Anyway, that's fine: the class is internal.

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Jun 4, 2020

#266 should fix a least some (maybe all?) failures with the polyfill-intl-normalizer, can you please give it a try?

@TRowbotham
Copy link
Contributor

TRowbotham commented Jun 4, 2020

Would you be up to sending your code in a PR on this repository, replacing the current one?

Yes, I can do that.

Can't we really? Did you spot any difference? If yes, at which version of PCRE/PHP? I'd be tempted to consider this as an issue we don't need to work around. Let's use /\p{M}/u, unless we have some data that tells us that a significant population would be impacted.

Unfortunately, there is a difference. Using /\p{M}/u results in 50 test failures on PHP 7.2 and 0 test failures on PHP 7.3. This is because PCRE is using an ancient version of Unicode in PHP 7.2. In PHP 7.3, the Unicode versions still don't match, but it just happens that there are no relevant changes in regards to the General Mark Category between versions 12.1.0 and 13.0.0. The PHP docs also state that an external PCRE library can be used instead of the bundled one, so there is no guarantee that the PCRE version is the same across PHP installs of the same version.

So, this is an issue even on recent versions of PHP and gets progressively worse the older the PHP version is.

Looking at PCRE changelog, the highest supported version of Unicode is 7.0.0. So, we can't expect any version of PHP using PCRE1 to have better Unicode support in regular expressions. My PHP 7.1/7.2 install has a PCRE version of 8.43.

My PHP 7.3/7.4 install reports PCRE is using Unicode version 12.1.0 and PCRE version 10.34, but they are using PCRE2. According to the PCRE2 changelog it only received Unicode 13.0.0 in Version 10.35 released on 09-May-2020.

I initially wrote this library using the intl extension to make it easier on myself. When I started replacing the use of the intl extension is when I first noticed the problem of inconsistent versions of Unicode in PHP. Asking IntlChar::charType() and preg_match('/\p{M}/u') if a code point was in the General Mark category gave 2 different answers in the same PHP installation. This was because the Unicode version used by the 2 extensions were different.

Can this be relaxed? Which 7.1 features are required?

Depends on how relaxed you want to get. I don't require anything super fancy. You could go back to at least PHP 5.5 if you rip out the type hints, const and function imports, and require the appropriate symfony/polyfill packages. There are a few other things like Unicode code point string escapes, variadic parameters using ..., generators, which can all be easily replaced with alternatives. I haven't evaluated PHP 5.3/5.4 support.

Don't you want to rename your class to Idn? Anyway, that's fine: the class is internal.

I'm fine with renaming to Idn, I don't feel strongly either way.

#266 should fix a least some (maybe all?) failures with the polyfill-intl-normalizer, can you please give it a try?

Yes, I will absolutely give this a try. Thanks!

@TRowbotham
Copy link
Contributor

UTS#46 defines a series of options that can be toggled by the caller to customize it's behavior. Two of those options, CheckHyphens and VerifyDnsLength, cannot be toggled when using the intl extension, but my library supports toggling them. The intl extension treats those 2 options as being always enabled.

To support configuring those 2 options would require adding 2 new constants even when the polyfill doesn't get used, but they would be no-ops when using the intl extension. I assume that we don't want to do that, but I wanted to mention it as an implementation detail.

@TRowbotham
Copy link
Contributor

I tried #266 and it didn't change the number of errors. Looking more closely, the errors with symfony/polyfill-intl-normalizer seem to all be a result of Normalizer::isNormalized() returning false more often than we'd like. All, 1,195 cases have the same error message and there is only one place in the code where this error gets emitted. #266 may not have been the solution here, the update is still greatly appreciated and I'm sure it helps in other cases that aren't in the tests.

Expected no errors, but found ERROR_INVALID_ACE_LABEL (1024)

Depending on how you interpret the following statement from Section 8.2. Testing Conformance the easiest thing to do here would be to change the tests to not check that there are no errors when it expects no errors.

Implementations may be more strict than the default settings for UTS46

This is more of a false negative than being more strict. In the meantime I'll go check out the normalization spec and think about it a bit.

@TRowbotham
Copy link
Contributor

So, I ended up just rolling my own isNormalized() method using the example code from the normalization spec. I only need NFC support and the required data set is relatively small for it.

I couldn't figure out how to disable the intl extension on travis, so I just pushed a branch using the symfony normalizer directly. All tests passed!

On a side note, the nightly build seems to be complaining about a non-existant constant on the Normalizer class.

Error: Undefined class constant 'Normalizer::NONE'

@GrahamCampbell
Copy link
Contributor

My PHP 7.3/7.4 install reports PCRE is using Unicode version 12.1.0 and PCRE version 10.34, but they are using PCRE2. According to the PCRE2 changelog it only received Unicode 13.0.0 in Version 10.35 released on 09-May-2020.

Actually, PHP has a bundled version of PCRE which is used by default. The versions are:

  • PHP 7.2.x: PREC 8.41
  • PHP 7.3.x: PRCE2 10.32
  • PHP 7.4.0-7.4.5: PRCE2 10.33
  • PHP 7.4.6+: PRCE2 10.34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants