Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46) #491

Open
jarthod opened this issue Feb 7, 2023 · 3 comments
Labels

Comments

@jarthod
Copy link
Contributor

jarthod commented Feb 7, 2023

Forked from #408, this is the separate ticket to deal with the improvement of the default (pure ruby) IDNA implementation.

  • IDNA2003 is the older standard (which was design for unicode 3.2), it's pretty permissive.
  • IDNA2008 is the newer one, which supports all unicode versions up to now (more characters supported, but also makes a lot of string invalid while IDNA2003 was pretty permissive, to reduce confusion/phishing risk and stuff like that).
  • UTS#46 is some kind of standard character mapping used in conjuction with IDNA2008, in order to avoid breaking compatibility with older domains to make the transision smoother.
  • There are unfortunately 4 "Deviations" chars between IDNA2003 and IDNA2008 which can't be smoothly transitionned, they are simply converted differently in both version. So clients using IDNA2008+UTS#46 have to choose between the old (called "Transitional") and new (called "Non-Transitional") behavior.

AFAICS the expectation is that while all registrar are upgrading to IDNA2008 and only allow valid hostnames (=approximately forever), browsers and web clients in general are encouraged to use IDNA2008+UTS#46 to widen support. So basically unless you're running a registrar, IDNA2008+UTS#46 is the target.

That's why libidn2 is implementing IDNA2008 + UTS#46 and default to the Non-Transitional (=new) mode. Which is also used by curl for example and probably many other web clients. Firefox and Safari also seems to do IDNA2008+UTS#46 Non-Transitional. Chrome was lagging a bit as it was still using Transitional mode up until very recently, apparently they juuust changed this to Non-Transitional in Chome 110. I can't verify this yet as I only have Chome 109 on Linux ^^

Edit (February 13th 2023): I just received Chrome 110 and confirmed the new behavior, http://faß.de now resolves to http://xn--fa-hia.de (and stays displayed as http://faß.de). Whereas in Chrome 109 it was transformed into http://fass.de (IDNA2003).

libidn (the current "native" option) implements IDNA2003 standard (the "older" one). IMO we should upgrade to libidn2, this will be discussed in #247.

The "pure" implementation is IDNA2008iiiisssshhhhh, but not compliant. As we can see in this example with an emoji modifier:

irb(main):004:0> s1 = "https://l♥️h.ws"
=> "https://l♥️h.ws"
irb(main):006:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x243d8 URI:https://xn--lh-t0xz926h.ws/>
irb(main):008:0> s1.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 65039, 104, 46, 119, 115]

If we compare that to the official Unicode test website):
image
https://xn--lh-t0xz926h.ws (returned by current "pure" implementation) is not even an option, no matter what standard we use, it's either xn--lh-t0x.ws or invalid (IDNA2008)

In order to make the pure implementation up to the state of art, we'll have to rewrite some of it (or bring in a dependency).
As I was looking at options for dependencies, I found:

Good news: the Unicode team provide some awesome comformance testing file with thousands of input string and the desired output for IDNA2008+UTS#46, for every version of Unicode, example: https://www.unicode.org/Public/idna/15.0.0/IdnaTestV2.txt

My suggestion here would be to go with an incremental rewrite in order to:

  • Remove all the custom unicode normalization functions by relying on ruby's instead.
  • Simplify and improve performance by rubyfiyng the punnycode function which is still in C (similar to simpleidn implementation)
  • Slightly update the code to the stricter IDNA2008 rule (rejecting invalid chars, etc..)
  • Use the official UTS#46 mapping tables to implement UTS#46 compatibility layer.
  • Use the extensive comformance testing file provided by the unicode team to robustly test this implementation

@sporkmonger @dentarg what do you think?

@dentarg
Copy link
Collaborator

dentarg commented Feb 7, 2023

I like your suggestions, it looks like a good plan to me

@sporkmonger
Copy link
Owner

I'm in favor, but also unable to prioritize this work myself. Happy to review a PR. I'm particularly glad to see there's great conformance testing options.

@jarthod
Copy link
Contributor Author

jarthod commented Feb 17, 2023

@sporkmonger thanks for the feedback, I'm gonna write the PR for this after #492.

@jarthod jarthod changed the title Improve pure ruby IDNA implementation to match brother behaviors (IDNA2008 and UTS#46) Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46) Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants