Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492

Merged
merged 1 commit into from Mar 14, 2023

Conversation

jarthod
Copy link
Contributor

@jarthod jarthod commented Feb 13, 2023

This is a fix for #408 and the beginning of the simplification and modernization of the IDNA code. It does:

  • Replaces libidn unicode normalize implementation by ruby's, to avoid the null terminated string problem and support more recent unicode versions.
  • Replaces Pure unicode normalize by ruby's, to avoid hundreds of line of slow and legacy ruby code and support more recent unicode versions.

The result in terms of performance is the following (benchmark code commited, but needs to be run on master to measure the legacy "pure" code):

#              user     system      total        real
# pure     1.325948   0.000000   1.325948 (  1.326054)
# libidn   0.058067   0.000000   0.058067 (  0.058069)
# ruby     0.325062   0.000000   0.325062 (  0.325115)

The native ruby implementation is 5-6 times slower than libidn, but also 5 times faster than the existing Pure one.

Considering this is just a small part and the rest of the ruby code is more than one order of magnitude slower (doing a full normalize in the simple.rb benchmark for example takes 8.6 seconds for the same 100k iterations on my machine), I think this is more than enough, no need to try to shave off microseconds in the unicode_normalize code if the rest is a lot of ruby.

And we benefit from the native ruby function which is going to be maintained and improved.
Specs have been added to cover the problematic \x00 case.

benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Outdated Show resolved Hide resolved
benchmark/unicode_normalize.rb Show resolved Hide resolved
spec/addressable/idna_spec.rb Outdated Show resolved Hide resolved
spec/addressable/idna_spec.rb Show resolved Hide resolved
@jarthod jarthod changed the title Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code WIP: Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code Feb 13, 2023
benchmark/unicode_normalize.rb Show resolved Hide resolved
benchmark/unicode_normalize.rb Show resolved Hide resolved
lib/addressable/template.rb Outdated Show resolved Hide resolved
@jarthod jarthod changed the title WIP: Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code Feb 14, 2023
@jarthod
Copy link
Contributor Author

jarthod commented Feb 14, 2023

Ok I've fixed some encoding issues with template on ruby 2.7- (passed with some ascii strings it was failing) and most hound warnings. I've left comments to explain the changes. All tests are passing. Let me know if you have any comment.

As another validation step, I'm gonna run this version in my product (https://updown.io) which converts 150k+ URl regularly to see if I spot anything.

@jarthod
Copy link
Contributor Author

jarthod commented Feb 14, 2023

I've just ran some test on 137k updown.io URLs, comparing normalization with previous and new code:

Using this branch of addressable:

> urls = Check.only(:type, :url).map { Addressable::URI.parse("#{_1.type}://#{_1.url}") }; urls.size
137546
> File.open("urls_after.txt", "w") { |f| f.puts urls.map(&:normalize) }

Same thing with main

> File.open("urls_before.txt", "w") { |f| f.puts urls.map(&:normalize) }

Then comparing the result (obfuscating the URLs):

> diff urls_before.txt urls_after.txt 
70777,70778c70777,70778
< https://xxx.com/?q=(+v%20%CC%84%20+)
< https://xxx.com/?q=(+v%20%CC%84%20+)
---
> https://xxx.com/?q=(+v%C2%AF%C2%A0+)
> https://xxx.com/?q=(+v%C2%AF%C2%A0+)

Only two URL impacted, the difference is the same, I looked closer it's one of the difference between NFKC and NFC.
The new version %C2%AF%C2%A0 is correct because it preserve the unicode character, the older version %20%CC%84%20 is an incorrect URL normalization because it applied NFKC instead of NFC.

So it looks all good here, I added two more specs to make sure the path is NFC normalized but NOT NFKC.

One of the tests failed in my latest push but it's unrelated (it's a ruby download issue on the windows instance), I can't restart it but if you can it'll surely pass :)

@jarthod jarthod requested a review from dentarg February 14, 2023 15:11
Copy link
Collaborator

@dentarg dentarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, though I'm no expert at all in this area! I trust our tests and your testing @jarthod :-) Thank you for this very clear pull request.

@dentarg
Copy link
Collaborator

dentarg commented Feb 15, 2023

I was thinking about a major version bump, but I guess this very much a bug fix? Unless there are some very strange edge cases we don't know about and people rely on.

@jarthod
Copy link
Contributor Author

jarthod commented Feb 15, 2023

Looks good to me, though I'm no expert at all in this area! I trust our tests and your testing @jarthod :-) Thank you for this very clear pull request.

Thanks! @brasic maybe you also have an opinion or feedback on this PR?

I was thinking about a major version bump, but I guess this very much a bug fix? Unless there are some very strange edge cases we don't know about and people rely on.

I don't think this deserves a major bump, it's more of a bugfix indeed.
#491 and #247 definitely will though :)

@dentarg
Copy link
Collaborator

dentarg commented Feb 15, 2023

#491 and #247 definitely will though :)

Hehe, yes, was thinking that too :)

Copy link
Owner

@sporkmonger sporkmonger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@sporkmonger
Copy link
Owner

Sorry about being slow to respond! I was out on vacation last week and the laptop stayed home.

@sporkmonger sporkmonger merged commit 5c22f25 into sporkmonger:main Mar 14, 2023
@jarthod
Copy link
Contributor Author

jarthod commented Mar 14, 2023

@sporkmonger no problem, thank you!

dentarg pushed a commit that referenced this pull request Apr 9, 2023
As discussed in #492 (comment), this change restores `unicode_normalize_kc` as a deprecated method (in case some people where using it). Example of the produced warning:
```
NOTE: Addressable::IDNA.unicode_normalize_kc is deprecated; use String#unicode_normalize(:nfkc) instead. It will be removed on or after 2023-04.
Addressable::IDNA.unicode_normalize_kc called from benchmark/unicode_normalize.rb:17.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants