Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative handling of illegal IDNs (such as domains with emojis) #18

Closed
pjsg opened this issue Sep 23, 2015 · 19 comments
Closed

Alternative handling of illegal IDNs (such as domains with emojis) #18

pjsg opened this issue Sep 23, 2015 · 19 comments

Comments

@pjsg
Copy link

pjsg commented Sep 23, 2015

The decode method can throw an exception when it finds characters not acceptable in IDNA2008. I think that the characters are acceptable in UTS46.

idna.decode("xn--co8ha.tk")

There isn't a way of signalling to decode that it should apply uts46 rules. UTS46 (in section 4.3) says:

Like [RFC3490], this will always produce a converted Unicode string. Unlike ToASCII of [RFC3490], this always signals whether or not there was an error.

The decode method currently indicates whether there was an error, but it does not always produce a converted unicode string.

The domain name above is a valid domain name and can be accessed: http://🐔🐔.tk/

Also, trying to encode this domain name also fails, even with uts46=True and transitional=True.

The python call

"xn--co8ha.tk".decode("idna")

does produce the right answer.

I would stick with the python idna2003 implementation, except that I need to improved handling of the german ß character.

@jribbens
Copy link
Collaborator

This domain is invalid (see for example http://unicode.org/cldr/utility/idna.jsp?a=%5CU0001f414%5CU0001f414.tk), why do you need to convert it? You can call idna.uts46_remap directly if you want.

@pjsg
Copy link
Author

pjsg commented Sep 23, 2015

For me, it shows OK in uts46 (but not in IDNA2008 or IDNA2003). I'm
looking for a library that handles a wide range of domains. The domain
name works and is handled by browsers -- so being unable to handle it
with the idna library is a problem.

On 23/09/2015 08:41, Jon Ribbens wrote:

This domain is invalid (see for example
http://unicode.org/cldr/utility/idna.jsp?a=%5CU0001f414%5CU0001f414.tk),
why do you need to convert it? You can call idna.uts46_remap directly
if you want.


Reply to this email directly or view it on GitHub
#18 (comment).

@jribbens
Copy link
Collaborator

This is an IDNA library, and the domain is invalid under IDNA ;-)
Seriously if what you want is to decode IDNA without any error checking (which strikes me as a bad idea, but whatever) and then run it through UTS46 then something like this might do it:

    decoded = idna.uts46_remap(".".join(
        label[4:].decode("punycode") if label.startswith(b"xn--") else
        label.decode("ascii") for label in domain.split(b".")))

Otherwise if you want to provide a patch to add support for ignoring errors then that seems not entirely unreasonable to me, albeit it's not my decision whether it goes into the project!

@kjd
Copy link
Owner

kjd commented Sep 24, 2015

I'm OK with having a more lenient conversion function when you pass an appropriate optional argument to the decode function, but the default should be standards compliance. At the end of the day this is an IDNA 2008 compliant library, and domains with emoji in them are illegal in IDNA 2008. These deprecated domains will ultimately stop working as domain registries and software implementers upgrade.

@pjsg
Copy link
Author

pjsg commented Sep 24, 2015

Ok. So if I add a uts46=True option to decode (with the default as
False), then you would consider a PR.

Thanks

Philip

On 24/09/2015 11:47, Kim Davies wrote:

I'm OK with having a more lenient conversion function when you pass an
appropriate optional argument to the decode function, but the default
should be standards compliance. At the end of the day this is an IDNA
2008 compliant library, and domains with emoji in them are illegal in
IDNA 2008. These deprecated domains will ultimately stop working as
domain registries and software implementers upgrade.


Reply to this email directly or view it on GitHub
#18 (comment).

@jribbens
Copy link
Collaborator

decode() already has exactly such an option. I think you mean an errors='ignore' option?

@pjsg
Copy link
Author

pjsg commented Sep 24, 2015

Duh. You are exactly right.

On 24/09/2015 14:28, Jon Ribbens wrote:

|decode()| already has exactly such an option. I think you mean an
|errors='ignore|' option?


Reply to this email directly or view it on GitHub
#18 (comment).

@AlexNigl
Copy link

AlexNigl commented Sep 6, 2016

I have the same issue with another domain:
xn--unicode-0g94f.ws

If you correctly check this and the aforementioned domain, both are valid:
http://unicode.org/cldr/utility/idna.jsp?a=xn--unicode-0g94f.ws%0D%0Axn--co8ha.tk

The problem comes from the hardcoded code points ( in this case codepoint_classes['PVALID'] ) in idna/idnadata.py which are most likely not up-to-date. You can get a current table from here: http://www.unicode.org/Public/idna/latest/

The solution to that might either be to change idna/idnadata.py every time a new Unicode version comes out or to hope pythons unicodedata library is always up-to-date and to derive the code points with the help of the rules in RFC5892.

I'm not sure what is the preferred way but I'm willing to take a stab at it, one way or another.

@kjd
Copy link
Owner

kjd commented Sep 7, 2016

@AlexNigl I am not clear on specifically what you are reporting. xn--unicode-0g94f.ws or xn--co8ha.tk can not convert to Unicode as neither is a legal IDNA domain. The Unicode tool result confirms this and gives the same result as this library. Both contain emoji which are invalid in domains names.

As to the version of Unicode, the IETF have temporarily fixed IDNA to Unicode 6.3.0 due to unintended issues with later versions (see issue #8), but that has no bearing on this specific issue. Unicode 9.0 would produce the exact same result based on RFC 5892 Section 2.1:

General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

These rules identify characters commonly used in mnemonics and often
informally described as "language characters". In general, only code
points assigned to this category are suitable for use in IDN.

We can see the general category for, say, the CHICKEN (U+1F414) is "So" which is not on the permitted list:

$ python3
Python 3.4.3 (default, Feb 25 2015, 16:10:55)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.category('🐔')
'So'
>>>

@AlexNigl
Copy link

AlexNigl commented Sep 9, 2016

@kjd I seem to have misread the use of PVALID (in your Code and RFC5892) regarding the "valid" code points in the IDNA Mapping Table from UTS 46. So please ignore my comments about the not up-to-date code points.

However it seems that the uts46 flag doesn't trigger the use of the IDNA Mapping Table (according to UTS-46) in "check_label" and instead uses the PVALID table according to RFC5892.

Input Output Expected Output
idna.decode('xn--7n8h', uts46=True) Error: idna.core.InvalidCodepoint "\U0001F410" (🐐)

The reason is that despite "\U0001F410" being invalid in IDNA2008 it is valid according to UTS 46.
From the IDNA Mapping Table:

1F400..1F43E ; valid ; ; NV8 # 6.0 RAT..PAW PRINTS

@threatlead
Copy link

IDNA library threw exception while handing emoji-domains from: https://xn--qeiaa.ws/ (GoDaddy) . Is this a related issue? Thanks!

>>> import idna
>>> print(idna.decode('xn--qeiaa.ws'))

Traceback (most recent call last):
File " < stdin > ", line 1, in < module >
File "/venv/lib/python3.4/site-packages/idna/core.py", line 384, in decode
result.append(ulabel(label))
File "/venv/lib/python3.4/site-packages/idna/core.py", line 303, in ulabel
check_label(label)
File "/venv/lib/python3.4/site-packages/idna/core.py", line 253, in check_label
raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2764 at position 1 of '❤❤❤' not allowed

@kjd
Copy link
Owner

kjd commented Feb 13, 2017

I am trying to think of the best generic solution to this and a similar issue found in issue #27 and issue #32. What they all have in common is they are not legal IDNs, but they are found in the wild due to other non-standards compliant software. As it is a common pattern to simply treat all potential hostnames, IDNA or not, as input to this library, so there is an argument for providing some mechanism of doing conversions around them.

Current use cases:

Trait Current Behaviour Alternative Behaviour
Hyphens in 3rd and 4th position but not an IDN Anything with hyphens in those positions is meant to be a valid IDN per the specification Ability to pass these through without conversion
Using emojis Anything that is a symbol (including emojis) is expressly not permitted in IDN labels. This was a deliberate design choice by the protocol designers. Allow non PVALID characters to be converted with explicit argument.

Both could be some twist on using an "errors" argument like Python's native encode/decode functions. Currently the library is analogous to "strict" behavior, but these alternatives would not be analogous to "replace" and "ignore" behaviors.

I'm wondering if adding an errors argument that has a number of potential, combinable values would make sense here:

Value Behaviour
strict Follow all IDNA 2008 rules (default, can't be combined)
skip-pvalid or allow-invalid or ? Do not test if code points are PVALID (will allow emojis and other illegal characters)
passthrough or allow-corrupt or ? Pass through hostnames that can't be decoded (any malformed punycode string is treated as opaque)

Not sure if there could be others. (I was thinking you could limit skip-pvalid to certain ranges or character classes, for example, but there is a potential never ending road of complexity going down that hole.)

In practice it would look something like this:

>>> idna.decode('r2---sn-huoa-cvhl.googlevideo.com')
idna.core.IDNAError: Label has disallowed hyphens in 3rd and 4th position
>>> idna.decode('r2---sn-huoa-cvhl.googlevideo.com', errors='passthrough')
'r2---sn-huoa-cvhl.googlevideo.com'
>>>

>>> idna.decode('xn--co8ha.tk')
idna.core.InvalidCodepoint: Codepoint U+1F414 at position 1 of '🐔🐔' not allowed
>>> idna.decode('xn--co8ha.tk', errors='skip-pvalid')
'🐔🐔.tk'

The two exception categories could be combined something along the lines of errors=['passthrough', 'skip-pvalid'] or errors='passthrough,skip-pvalid'.

The biggest concern is that skip-pvalid will circumvent the very protections that drove the creation of IDNA 2008 in the first place. IDNA 2003 was more lenient in this regard (it did allow symbols), but as a result it allowed combinations of characters that were abused for phishing attacks etc. Therefore, it is dangerous to turn this off without the end-user explicitly knowing this is happening — it is not just unconformant with the protocol, but breaks the very security principles inherent in its design. I feel there needs to be something very explicit to an implementer that doing this is a really bad idea and will cause problems, and should never be used in production code without compensating controls. Use cases are perhaps in a closed system or in something that is doing debugging, production systems taking general user input that need to interoperate should not be doing this.

Does anyone have any thoughts or ideas on this approach or alternatives?

@kjd kjd added this to the v2.3 milestone Feb 13, 2017
@sharno
Copy link

sharno commented Feb 24, 2017

@kjd Maybe you can give a warning whenever this illegal conversion happens, also putting a warning in the documentation too would help. Also putting a function that says if this is a valid 2008 IDNA would be great to see if this library is used for some processing on domain names.

@kjd
Copy link
Owner

kjd commented Feb 24, 2017

idna.encode() is a de-facto function to test IDNA 2008 validity of a domain. It will return the encoded domain if successful (and thus valid), and throw an IDNAError exception if not.

@kjd kjd modified the milestones: v2.3, v2.4 Feb 28, 2017
@kjd kjd modified the milestones: v2.4, v2.5 Mar 1, 2017
@nlevitt
Copy link

nlevitt commented Mar 8, 2017

See also #40 and http://unicode.org/cldr/utility/idna.jsp?a=%E2%98%83.net
Imho this library needs to match browser behavior (at least optionally) or its usefulness is severely limited. Which also means it should reject urls that browsers reject, for instance:

new URL('http://日本⒈co.jp');
VM142:1 Uncaught TypeError: Failed to construct 'URL': Invalid URL
    at <anonymous>:1:1
(anonymous) @ VM142:1

@jribbens
Copy link
Collaborator

jribbens commented Mar 8, 2017

IMHO it's better to report the acceptance of emoji (etc) domains to the browser vendors as security bugs in the browsers...

@kjd kjd changed the title Handling of UTS 46 in decode Alternative handling of illegal IDNs (such as domains with emojis) Feb 25, 2020
@j12i
Copy link

j12i commented Feb 17, 2021

Hi.
I'm also here because

>>> idna.decode('xn--238h.to')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.9/site-packages/idna/core.py", line 389, in decode
    s = ulabel(label)
  File "/usr/lib64/python3.9/site-packages/idna/core.py", line 308, in ulabel
    check_label(label)
  File "/usr/lib64/python3.9/site-packages/idna/core.py", line 257, in check_label
    raise InvalidCodepoint('Codepoint {} at position {} of {} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+1F63B at position 1 of '😻' not allowed

and when I checked it in the browser, the domain worked.
I'm fine with the current behaviour. I'm glad I learned the domain is invalid even though it works in the browser.
I think you should try and provide a path from the exception to an explanation akin to this one.

@pmevzek-godaddy
Copy link

pmevzek-godaddy commented May 3, 2021

@j12i xn--238h.to is not "invalid". It is not IDNA2008 compliant, that is true, so of course a library implementing IDNA2008 as is idna Python module will consider the name invalid per those rules. However it exists, and works. Why? Because some registries, and the one for to is among them, decided not to follow the IDNA2008 standard. Everyone can have their opinion if they did it right or not, but DNS-wise each zone has administrative power to decide the rules applied in it, and if ICANN requires IDNA2008 over all gTLDs because they are under contract with it, ccTLDs like to have more freedom and hence such cases exist. And may well exist "forever".

Note that per Wikipedia (on "Emoji domains"), right now:

As of April 2021, there are ten top-level domains for which registration is possible: .cf, .ga, .gq, .ml, .tk, .st, .fm, .to, .kz and .ws.

You also have the case, in any TLD, of names created before IDNA2008 started to be enforced. Most of the times, registry will keep them (until registrant deletes them). For example, one year ago ♫.com (with U+266B character) aka xn--m6h.com still existed (but seems to have been deleted since then). They are probably plenty others. It creates problem for any software because it means IDNA2008-compatibility or not is not just per TLD but might be per name :-(

kjd added a commit that referenced this issue Oct 12, 2021
Moved some items around and added text about version compatibility
and emoji domains
kjd added a commit that referenced this issue Oct 12, 2021
@kjd
Copy link
Owner

kjd commented Sep 13, 2022

Closing this issue. Mitigations for this are currently referenced in the project's documentation, which links to this issue for anyone that wants to read the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants