Alternative handling of illegal IDNs (such as domains with emojis) #18

pjsg · 2015-09-23T09:38:59Z

The decode method can throw an exception when it finds characters not acceptable in IDNA2008. I think that the characters are acceptable in UTS46.

idna.decode("xn--co8ha.tk")

There isn't a way of signalling to decode that it should apply uts46 rules. UTS46 (in section 4.3) says:

Like [RFC3490], this will always produce a converted Unicode string. Unlike ToASCII of [RFC3490], this always signals whether or not there was an error.

The decode method currently indicates whether there was an error, but it does not always produce a converted unicode string.

The domain name above is a valid domain name and can be accessed: http://🐔🐔.tk/

Also, trying to encode this domain name also fails, even with uts46=True and transitional=True.

The python call

"xn--co8ha.tk".decode("idna")

does produce the right answer.

I would stick with the python idna2003 implementation, except that I need to improved handling of the german ß character.

jribbens · 2015-09-23T12:41:57Z

This domain is invalid (see for example http://unicode.org/cldr/utility/idna.jsp?a=%5CU0001f414%5CU0001f414.tk), why do you need to convert it? You can call idna.uts46_remap directly if you want.

pjsg · 2015-09-23T12:51:29Z

For me, it shows OK in uts46 (but not in IDNA2008 or IDNA2003). I'm
looking for a library that handles a wide range of domains. The domain
name works and is handled by browsers -- so being unable to handle it
with the idna library is a problem.

On 23/09/2015 08:41, Jon Ribbens wrote:

This domain is invalid (see for example
http://unicode.org/cldr/utility/idna.jsp?a=%5CU0001f414%5CU0001f414.tk),
why do you need to convert it? You can call idna.uts46_remap directly
if you want.

—
Reply to this email directly or view it on GitHub
#18 (comment).

jribbens · 2015-09-23T13:36:32Z

This is an IDNA library, and the domain is invalid under IDNA ;-)
Seriously if what you want is to decode IDNA without any error checking (which strikes me as a bad idea, but whatever) and then run it through UTS46 then something like this might do it:

    decoded = idna.uts46_remap(".".join(
        label[4:].decode("punycode") if label.startswith(b"xn--") else
        label.decode("ascii") for label in domain.split(b".")))

Otherwise if you want to provide a patch to add support for ignoring errors then that seems not entirely unreasonable to me, albeit it's not my decision whether it goes into the project!

kjd · 2015-09-24T15:47:49Z

I'm OK with having a more lenient conversion function when you pass an appropriate optional argument to the decode function, but the default should be standards compliance. At the end of the day this is an IDNA 2008 compliant library, and domains with emoji in them are illegal in IDNA 2008. These deprecated domains will ultimately stop working as domain registries and software implementers upgrade.

pjsg · 2015-09-24T18:01:03Z

Ok. So if I add a uts46=True option to decode (with the default as
False), then you would consider a PR.

Thanks

Philip

On 24/09/2015 11:47, Kim Davies wrote:

I'm OK with having a more lenient conversion function when you pass an
appropriate optional argument to the decode function, but the default
should be standards compliance. At the end of the day this is an IDNA
2008 compliant library, and domains with emoji in them are illegal in
IDNA 2008. These deprecated domains will ultimately stop working as
domain registries and software implementers upgrade.

—
Reply to this email directly or view it on GitHub
#18 (comment).

jribbens · 2015-09-24T18:28:11Z

decode() already has exactly such an option. I think you mean an errors='ignore' option?

pjsg · 2015-09-24T18:58:48Z

Duh. You are exactly right.

On 24/09/2015 14:28, Jon Ribbens wrote:

|decode()| already has exactly such an option. I think you mean an
|errors='ignore|' option?

—
Reply to this email directly or view it on GitHub
#18 (comment).

AlexNigl · 2016-09-06T21:10:14Z

I have the same issue with another domain:
xn--unicode-0g94f.ws

If you correctly check this and the aforementioned domain, both are valid:
http://unicode.org/cldr/utility/idna.jsp?a=xn--unicode-0g94f.ws%0D%0Axn--co8ha.tk

The problem comes from the hardcoded code points ( in this case codepoint_classes['PVALID'] ) in idna/idnadata.py which are most likely not up-to-date. You can get a current table from here: http://www.unicode.org/Public/idna/latest/

The solution to that might either be to change idna/idnadata.py every time a new Unicode version comes out or to hope pythons unicodedata library is always up-to-date and to derive the code points with the help of the rules in RFC5892.

I'm not sure what is the preferred way but I'm willing to take a stab at it, one way or another.

kjd · 2016-09-07T16:10:38Z

@AlexNigl I am not clear on specifically what you are reporting. xn--unicode-0g94f.ws or xn--co8ha.tk can not convert to Unicode as neither is a legal IDNA domain. The Unicode tool result confirms this and gives the same result as this library. Both contain emoji which are invalid in domains names.

As to the version of Unicode, the IETF have temporarily fixed IDNA to Unicode 6.3.0 due to unintended issues with later versions (see issue #8), but that has no bearing on this specific issue. Unicode 9.0 would produce the exact same result based on RFC 5892 Section 2.1:

General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

These rules identify characters commonly used in mnemonics and often
informally described as "language characters". In general, only code
points assigned to this category are suitable for use in IDN.

We can see the general category for, say, the CHICKEN (U+1F414) is "So" which is not on the permitted list:

$ python3
Python 3.4.3 (default, Feb 25 2015, 16:10:55)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.category('🐔')
'So'
>>>

AlexNigl · 2016-09-09T15:21:11Z

@kjd I seem to have misread the use of PVALID (in your Code and RFC5892) regarding the "valid" code points in the IDNA Mapping Table from UTS 46. So please ignore my comments about the not up-to-date code points.

However it seems that the uts46 flag doesn't trigger the use of the IDNA Mapping Table (according to UTS-46) in "check_label" and instead uses the PVALID table according to RFC5892.

Input	Output	Expected Output
idna.decode('xn--7n8h', uts46=True)	Error: idna.core.InvalidCodepoint	"\U0001F410" (🐐)

The reason is that despite "\U0001F410" being invalid in IDNA2008 it is valid according to UTS 46.
From the IDNA Mapping Table:

1F400..1F43E ; valid ; ; NV8 # 6.0 RAT..PAW PRINTS

threatlead · 2016-12-22T04:57:38Z

IDNA library threw exception while handing emoji-domains from: https://xn--qeiaa.ws/ (GoDaddy) . Is this a related issue? Thanks!

>>> import idna
>>> print(idna.decode('xn--qeiaa.ws'))

Traceback (most recent call last):
File " < stdin > ", line 1, in < module >
File "/venv/lib/python3.4/site-packages/idna/core.py", line 384, in decode
result.append(ulabel(label))
File "/venv/lib/python3.4/site-packages/idna/core.py", line 303, in ulabel
check_label(label)
File "/venv/lib/python3.4/site-packages/idna/core.py", line 253, in check_label
raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2764 at position 1 of '❤❤❤' not allowed

kjd · 2017-02-13T19:52:53Z

I am trying to think of the best generic solution to this and a similar issue found in issue #27 and issue #32. What they all have in common is they are not legal IDNs, but they are found in the wild due to other non-standards compliant software. As it is a common pattern to simply treat all potential hostnames, IDNA or not, as input to this library, so there is an argument for providing some mechanism of doing conversions around them.

Current use cases:

Trait	Current Behaviour	Alternative Behaviour
Hyphens in 3rd and 4th position but not an IDN	Anything with hyphens in those positions is meant to be a valid IDN per the specification	Ability to pass these through without conversion
Using emojis	Anything that is a symbol (including emojis) is expressly not permitted in IDN labels. This was a deliberate design choice by the protocol designers.	Allow non PVALID characters to be converted with explicit argument.

Both could be some twist on using an "errors" argument like Python's native encode/decode functions. Currently the library is analogous to "strict" behavior, but these alternatives would not be analogous to "replace" and "ignore" behaviors.

I'm wondering if adding an errors argument that has a number of potential, combinable values would make sense here:

Value	Behaviour
`strict`	Follow all IDNA 2008 rules (default, can't be combined)
`skip-pvalid` or `allow-invalid` or ?	Do not test if code points are PVALID (will allow emojis and other illegal characters)
`passthrough` or `allow-corrupt` or ?	Pass through hostnames that can't be decoded (any malformed punycode string is treated as opaque)

Not sure if there could be others. (I was thinking you could limit skip-pvalid to certain ranges or character classes, for example, but there is a potential never ending road of complexity going down that hole.)

In practice it would look something like this:

>>> idna.decode('r2---sn-huoa-cvhl.googlevideo.com')
idna.core.IDNAError: Label has disallowed hyphens in 3rd and 4th position
>>> idna.decode('r2---sn-huoa-cvhl.googlevideo.com', errors='passthrough')
'r2---sn-huoa-cvhl.googlevideo.com'
>>>

>>> idna.decode('xn--co8ha.tk')
idna.core.InvalidCodepoint: Codepoint U+1F414 at position 1 of '🐔🐔' not allowed
>>> idna.decode('xn--co8ha.tk', errors='skip-pvalid')
'🐔🐔.tk'

The two exception categories could be combined something along the lines of errors=['passthrough', 'skip-pvalid'] or errors='passthrough,skip-pvalid'.

The biggest concern is that skip-pvalid will circumvent the very protections that drove the creation of IDNA 2008 in the first place. IDNA 2003 was more lenient in this regard (it did allow symbols), but as a result it allowed combinations of characters that were abused for phishing attacks etc. Therefore, it is dangerous to turn this off without the end-user explicitly knowing this is happening — it is not just unconformant with the protocol, but breaks the very security principles inherent in its design. I feel there needs to be something very explicit to an implementer that doing this is a really bad idea and will cause problems, and should never be used in production code without compensating controls. Use cases are perhaps in a closed system or in something that is doing debugging, production systems taking general user input that need to interoperate should not be doing this.

Does anyone have any thoughts or ideas on this approach or alternatives?

sharno · 2017-02-24T09:20:43Z

@kjd Maybe you can give a warning whenever this illegal conversion happens, also putting a warning in the documentation too would help. Also putting a function that says if this is a valid 2008 IDNA would be great to see if this library is used for some processing on domain names.

kjd · 2017-02-24T19:24:16Z

idna.encode() is a de-facto function to test IDNA 2008 validity of a domain. It will return the encoded domain if successful (and thus valid), and throw an IDNAError exception if not.

nlevitt · 2017-03-08T19:26:07Z

See also #40 and http://unicode.org/cldr/utility/idna.jsp?a=%E2%98%83.net
Imho this library needs to match browser behavior (at least optionally) or its usefulness is severely limited. Which also means it should reject urls that browsers reject, for instance:

new URL('http://日本⒈co.jp');
VM142:1 Uncaught TypeError: Failed to construct 'URL': Invalid URL
    at <anonymous>:1:1
(anonymous) @ VM142:1

jribbens · 2017-03-08T19:37:02Z

IMHO it's better to report the acceptance of emoji (etc) domains to the browser vendors as security bugs in the browsers...

j12i · 2021-02-17T16:47:04Z

Hi.
I'm also here because

>>> idna.decode('xn--238h.to')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.9/site-packages/idna/core.py", line 389, in decode
    s = ulabel(label)
  File "/usr/lib64/python3.9/site-packages/idna/core.py", line 308, in ulabel
    check_label(label)
  File "/usr/lib64/python3.9/site-packages/idna/core.py", line 257, in check_label
    raise InvalidCodepoint('Codepoint {} at position {} of {} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+1F63B at position 1 of '😻' not allowed

and when I checked it in the browser, the domain worked.
I'm fine with the current behaviour. I'm glad I learned the domain is invalid even though it works in the browser.
I think you should try and provide a path from the exception to an explanation akin to this one.

pmevzek-godaddy · 2021-05-03T00:29:45Z

@j12i xn--238h.to is not "invalid". It is not IDNA2008 compliant, that is true, so of course a library implementing IDNA2008 as is idna Python module will consider the name invalid per those rules. However it exists, and works. Why? Because some registries, and the one for to is among them, decided not to follow the IDNA2008 standard. Everyone can have their opinion if they did it right or not, but DNS-wise each zone has administrative power to decide the rules applied in it, and if ICANN requires IDNA2008 over all gTLDs because they are under contract with it, ccTLDs like to have more freedom and hence such cases exist. And may well exist "forever".

Note that per Wikipedia (on "Emoji domains"), right now:

As of April 2021, there are ten top-level domains for which registration is possible: .cf, .ga, .gq, .ml, .tk, .st, .fm, .to, .kz and .ws.

You also have the case, in any TLD, of names created before IDNA2008 started to be enforced. Most of the times, registry will keep them (until registrant deletes them). For example, one year ago ♫.com (with U+266B character) aka xn--m6h.com still existed (but seems to have been deleted since then). They are probably plenty others. It creates problem for any software because it means IDNA2008-compatibility or not is not just per TLD but might be per name :-(

…2008 eg https://todayinmarch2020.🦈🖥.ws/ , https://🕸💍.ws/ , https://🐷🔥.ws https://unicode.org/faq/idn.html#6 psf/requests#3687 kjd/idna#18 kjd/idna#40

Moved some items around and added text about version compatibility and emoji domains

Adjust documentation (issue #18)

kjd · 2022-09-13T01:18:46Z

Closing this issue. Mitigations for this are currently referenced in the project's documentation, which links to this issue for anyone that wants to read the discussion.

kjd added this to the v2.3 milestone Feb 13, 2017

kjd modified the milestones: v2.3, v2.4 Feb 28, 2017

kjd added the enhancement label Feb 28, 2017

kjd modified the milestones: v2.4, v2.5 Mar 1, 2017

kjd mentioned this issue Mar 10, 2017

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

Closed

kjd mentioned this issue Apr 2, 2018

Getting exception for certain URLs that work with curl and other tools #60

Closed

kjd mentioned this issue Apr 10, 2018

Failing domain names #61

Closed

kjd removed this from the v2.x milestone Feb 25, 2020

kjd changed the title ~~Handling of UTS 46 in decode~~ Alternative handling of illegal IDNs (such as domains with emojis) Feb 25, 2020

Sorunome mentioned this issue Dec 26, 2020

not all punycode-encoded domains work matrix-org/synapse#8991

Open

glyph mentioned this issue Jan 21, 2021

DecodedURL.to_uri is inconsistent with DecodedURL.normalize, .child, etc python-hyper/hyperlink#144

Open

snarfed mentioned this issue May 9, 2021

because of idna2008 enforcement some real urls that work in the browser are now broken psf/requests#3687

Closed

snarfed mentioned this issue Jun 22, 2021

Relax IDNA2008 requirement? psf/requests#5845

Closed

kjd added a commit that referenced this issue Oct 12, 2021

Adjust documentation (issue #18)

e7c7563

Moved some items around and added text about version compatibility and emoji domains

kjd added a commit that referenced this issue Oct 12, 2021

Merge pull request #115 from kjd/deprecation-policy

784edd5

Adjust documentation (issue #18)

kjd closed this as completed Sep 13, 2022

kjd mentioned this issue Nov 15, 2022

Codepoint U+2603 not allowed #136

Closed

john-parton mentioned this issue Nov 17, 2022

Add uts46 and transitional to idna.encode/decode to support legacy emoji and idna domains twisted/twisted#11760

Open

twisted-trac mentioned this issue Nov 17, 2022

Unable to do http calls to punycode-encoded emoji domains twisted/twisted#10078

Open

matrixbot mentioned this issue Dec 21, 2023

not all punycode-encoded domains work element-hq/synapse#8991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative handling of illegal IDNs (such as domains with emojis) #18

Alternative handling of illegal IDNs (such as domains with emojis) #18

pjsg commented Sep 23, 2015

jribbens commented Sep 23, 2015

pjsg commented Sep 23, 2015

jribbens commented Sep 23, 2015

kjd commented Sep 24, 2015

pjsg commented Sep 24, 2015

jribbens commented Sep 24, 2015

pjsg commented Sep 24, 2015

AlexNigl commented Sep 6, 2016 •

edited

kjd commented Sep 7, 2016

AlexNigl commented Sep 9, 2016

threatlead commented Dec 22, 2016

kjd commented Feb 13, 2017

sharno commented Feb 24, 2017

kjd commented Feb 24, 2017

nlevitt commented Mar 8, 2017 •

edited

jribbens commented Mar 8, 2017

j12i commented Feb 17, 2021

pmevzek-godaddy commented May 3, 2021 •

edited

kjd commented Sep 13, 2022

Alternative handling of illegal IDNs (such as domains with emojis) #18

Alternative handling of illegal IDNs (such as domains with emojis) #18

Comments

pjsg commented Sep 23, 2015

jribbens commented Sep 23, 2015

pjsg commented Sep 23, 2015

jribbens commented Sep 23, 2015

kjd commented Sep 24, 2015

pjsg commented Sep 24, 2015

jribbens commented Sep 24, 2015

pjsg commented Sep 24, 2015

AlexNigl commented Sep 6, 2016 • edited

kjd commented Sep 7, 2016

AlexNigl commented Sep 9, 2016

threatlead commented Dec 22, 2016

kjd commented Feb 13, 2017

sharno commented Feb 24, 2017

kjd commented Feb 24, 2017

nlevitt commented Mar 8, 2017 • edited

jribbens commented Mar 8, 2017

j12i commented Feb 17, 2021

pmevzek-godaddy commented May 3, 2021 • edited

kjd commented Sep 13, 2022

AlexNigl commented Sep 6, 2016 •

edited

nlevitt commented Mar 8, 2017 •

edited

pmevzek-godaddy commented May 3, 2021 •

edited