Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492

jarthod · 2023-02-13T23:22:22Z

This is a fix for #408 and the beginning of the simplification and modernization of the IDNA code. It does:

Replaces libidn unicode normalize implementation by ruby's, to avoid the null terminated string problem and support more recent unicode versions.
Replaces Pure unicode normalize by ruby's, to avoid hundreds of line of slow and legacy ruby code and support more recent unicode versions.

The result in terms of performance is the following (benchmark code commited, but needs to be run on master to measure the legacy "pure" code):

#              user     system      total        real
# pure     1.325948   0.000000   1.325948 (  1.326054)
# libidn   0.058067   0.000000   0.058067 (  0.058069)
# ruby     0.325062   0.000000   0.325062 (  0.325115)

The native ruby implementation is 5-6 times slower than libidn, but also 5 times faster than the existing Pure one.

Considering this is just a small part and the rest of the ruby code is more than one order of magnitude slower (doing a full normalize in the simple.rb benchmark for example takes 8.6 seconds for the same 100k iterations on my machine), I think this is more than enough, no need to try to shave off microseconds in the unicode_normalize code if the rest is a lot of ruby.

And we benefit from the native ruby function which is going to be maintained and improved.
Specs have been added to cover the problematic \x00 case.

benchmark/unicode_normalize.rb

spec/addressable/idna_spec.rb

lib/addressable/idna/pure.rb

lib/addressable/uri.rb

spec/addressable/idna_spec.rb

benchmark/unicode_normalize.rb

lib/addressable/template.rb

lib/addressable/uri.rb

spec/addressable/idna_spec.rb

jarthod · 2023-02-14T10:55:11Z

Ok I've fixed some encoding issues with template on ruby 2.7- (passed with some ascii strings it was failing) and most hound warnings. I've left comments to explain the changes. All tests are passing. Let me know if you have any comment.

As another validation step, I'm gonna run this version in my product (https://updown.io) which converts 150k+ URl regularly to see if I spot anything.

spec/addressable/uri_spec.rb

jarthod · 2023-02-14T13:06:00Z

I've just ran some test on 137k updown.io URLs, comparing normalization with previous and new code:

Using this branch of addressable:

> urls = Check.only(:type, :url).map { Addressable::URI.parse("#{_1.type}://#{_1.url}") }; urls.size
137546
> File.open("urls_after.txt", "w") { |f| f.puts urls.map(&:normalize) }

Same thing with main

> File.open("urls_before.txt", "w") { |f| f.puts urls.map(&:normalize) }

Then comparing the result (obfuscating the URLs):

> diff urls_before.txt urls_after.txt 
70777,70778c70777,70778
< https://xxx.com/?q=(+v%20%CC%84%20+)
< https://xxx.com/?q=(+v%20%CC%84%20+)
---
> https://xxx.com/?q=(+v%C2%AF%C2%A0+)
> https://xxx.com/?q=(+v%C2%AF%C2%A0+)

Only two URL impacted, the difference is the same, I looked closer it's one of the difference between NFKC and NFC.
The new version %C2%AF%C2%A0 is correct because it preserve the unicode character, the older version %20%CC%84%20 is an incorrect URL normalization because it applied NFKC instead of NFC.

So it looks all good here, I added two more specs to make sure the path is NFC normalized but NOT NFKC.

One of the tests failed in my latest push but it's unrelated (it's a ruby download issue on the windows instance), I can't restart it but if you can it'll surely pass :)

dentarg

Looks good to me, though I'm no expert at all in this area! I trust our tests and your testing @jarthod :-) Thank you for this very clear pull request.

dentarg · 2023-02-15T16:17:29Z

I was thinking about a major version bump, but I guess this very much a bug fix? Unless there are some very strange edge cases we don't know about and people rely on.

jarthod · 2023-02-15T16:28:19Z

Looks good to me, though I'm no expert at all in this area! I trust our tests and your testing @jarthod :-) Thank you for this very clear pull request.

Thanks! @brasic maybe you also have an opinion or feedback on this PR?

I was thinking about a major version bump, but I guess this very much a bug fix? Unless there are some very strange edge cases we don't know about and people rely on.

I don't think this deserves a major bump, it's more of a bugfix indeed.
#491 and #247 definitely will though :)

dentarg · 2023-02-15T16:30:03Z

#491 and #247 definitely will though :)

Hehe, yes, was thinking that too :)

lib/addressable/idna/pure.rb

…y ruby code

sporkmonger

Looks good to me!

sporkmonger · 2023-03-14T21:54:59Z

Sorry about being slow to respond! I was out on vacation last week and the laptop stayed home.

jarthod · 2023-03-14T22:04:11Z

@sporkmonger no problem, thank you!

lib/addressable/idna/native.rb

As discussed in #492 (comment), this change restores `unicode_normalize_kc` as a deprecated method (in case some people where using it). Example of the produced warning: ``` NOTE: Addressable::IDNA.unicode_normalize_kc is deprecated; use String#unicode_normalize(:nfkc) instead. It will be removed on or after 2023-04. Addressable::IDNA.unicode_normalize_kc called from benchmark/unicode_normalize.rb:17. ```

houndci-bot reviewed Feb 13, 2023

View reviewed changes

jarthod commented Feb 13, 2023

View reviewed changes

lib/addressable/idna/pure.rb Show resolved Hide resolved

jarthod commented Feb 13, 2023

View reviewed changes

lib/addressable/uri.rb Show resolved Hide resolved

jarthod commented Feb 13, 2023

View reviewed changes

spec/addressable/idna_spec.rb Show resolved Hide resolved

jarthod commented Feb 13, 2023

View reviewed changes

spec/addressable/idna_spec.rb Outdated Show resolved Hide resolved

jarthod changed the title ~~Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code~~ WIP: Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code Feb 13, 2023

stehorne6 mentioned this pull request Feb 14, 2023

ReDOS when using a variable path and query during extract #364

Open

jarthod force-pushed the fix-null-normalization-408 branch from 2fff371 to 30fdbf1 Compare February 14, 2023 09:55

houndci-bot reviewed Feb 14, 2023

View reviewed changes

benchmark/unicode_normalize.rb Show resolved Hide resolved

benchmark/unicode_normalize.rb Show resolved Hide resolved

lib/addressable/template.rb Outdated Show resolved Hide resolved

jarthod commented Feb 14, 2023

View reviewed changes

lib/addressable/template.rb Show resolved Hide resolved

jarthod commented Feb 14, 2023

View reviewed changes

lib/addressable/uri.rb Show resolved Hide resolved

jarthod commented Feb 14, 2023

View reviewed changes

lib/addressable/uri.rb Show resolved Hide resolved

jarthod commented Feb 14, 2023

View reviewed changes

spec/addressable/idna_spec.rb Show resolved Hide resolved

jarthod changed the title ~~WIP: Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code~~ Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code Feb 14, 2023

jarthod force-pushed the fix-null-normalization-408 branch from 30fdbf1 to 3f4611d Compare February 14, 2023 12:57

houndci-bot reviewed Feb 14, 2023

View reviewed changes

spec/addressable/uri_spec.rb Outdated Show resolved Hide resolved

jarthod force-pushed the fix-null-normalization-408 branch from 3f4611d to 9499a18 Compare February 14, 2023 12:58

jarthod requested a review from dentarg February 14, 2023 15:11

dentarg approved these changes Feb 15, 2023

View reviewed changes

dentarg requested review from sporkmonger and therabidbanana February 15, 2023 16:30

sporkmonger reviewed Feb 16, 2023

View reviewed changes

lib/addressable/idna/pure.rb Show resolved Hide resolved

jarthod mentioned this pull request Feb 17, 2023

Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46) #491

Open

jarthod force-pushed the fix-null-normalization-408 branch 2 times, most recently from 49bfa83 to c772114 Compare February 18, 2023 20:13

Use ruby unicode normalize to avoid libidn C problems and heavy legac…

1998e06

…y ruby code

jarthod force-pushed the fix-null-normalization-408 branch from c772114 to 1998e06 Compare February 20, 2023 10:35

joseportillam approved these changes Mar 4, 2023

View reviewed changes

sporkmonger approved these changes Mar 14, 2023

View reviewed changes

sporkmonger merged commit 5c22f25 into sporkmonger:main Mar 14, 2023

jarthod deleted the fix-null-normalization-408 branch March 14, 2023 22:07

jarthod mentioned this pull request Mar 14, 2023

Normalization differences between IDNA::Native and IDNA::Pure #408

Closed

dentarg mentioned this pull request Apr 3, 2023

undefined method `to_str' for 🆔Symbol (NoMethodError) in 2.8.2 #498

Closed

dentarg reviewed Apr 4, 2023

View reviewed changes

lib/addressable/idna/native.rb Show resolved Hide resolved

jarthod mentioned this pull request Apr 6, 2023

restore unicode_normalize_kc as a deprecated method #504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492

Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492

jarthod commented Feb 13, 2023

jarthod commented Feb 14, 2023

jarthod commented Feb 14, 2023

dentarg left a comment

dentarg commented Feb 15, 2023

jarthod commented Feb 15, 2023

dentarg commented Feb 15, 2023

sporkmonger left a comment

sporkmonger commented Mar 14, 2023

jarthod commented Mar 14, 2023

Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492

Use ruby unicode normalize to avoid libidn C problems and heavy legacy ruby code #492

Conversation

jarthod commented Feb 13, 2023

jarthod commented Feb 14, 2023

jarthod commented Feb 14, 2023

dentarg left a comment

Choose a reason for hiding this comment

dentarg commented Feb 15, 2023

jarthod commented Feb 15, 2023

dentarg commented Feb 15, 2023

sporkmonger left a comment

Choose a reason for hiding this comment

sporkmonger commented Mar 14, 2023

jarthod commented Mar 14, 2023