Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing domain names #61

Closed
aggiebill opened this issue Apr 10, 2018 · 8 comments
Closed

Failing domain names #61

aggiebill opened this issue Apr 10, 2018 · 8 comments

Comments

@aggiebill
Copy link

I've been looking at IDNA domain registrations and using your library in conjunction with the built in python tools.

>>> domain = "xn--53hy7af013i.ws".encode("utf-8")                                                                                     
>>> domain.decode("idna")         
'☕🦊✈.ws'

The IDNA package is a HUGE life saver for me. I monitored approximately over 111,000 IDNA domains being registered, and a small percentage of them failed. I've attached the output that I thought you might find useful.

As you can see above, there is an uptick of registrations of emoji domains now. Although it is not part of the specification, it would be very helpful if that was incorporated into this package.

failed_output.txt

@kjd
Copy link
Owner

kjd commented Apr 10, 2018

I'll start by saying, this is working as intended. Those domains are illegal IDNs.

The few zones that have emoji Unicode characters encoded in the punycode are typically registries that perform no registry-side validation checking whatsoever like Samoa (.ws), or legacy registrations from before the standards were finalized. The clients that support emojis are due to transitional strategies moving from the old version of IDNA, but as software matures those domains will be supported less and less.

See https://www.icann.org/en/system/files/files/sac-095-en.pdf for recent discussion of the issue, and the ICANN board resolved late last year to direct policy making to ensure adherence to the latest version of IDNA (i.e. not allowing emoji): https://features.icann.org/ssac-advisory-use-emoji-domain-names

There may be an argument for debugging reasons or otherwise to skip the validation component of the IDNA (see #18 (comment) for related discussion). But it should certainly not be part of the default logic as it opens clients to all sorts of security issues that IDNA is designed to prevent, following issues like homophone attacks that were demonstrated with the earlier version.

Based on your feedback, the domains you identified, after removing duplicates, appear only in a small number of zones. Here are the zones with 5 or more domains in the list:

$ cat failed_output.txt | grep "idna failed" | cut -f4 -d" " | sort | uniq | sed -E 's/.*xn--[^\.]+\.(.*)$/\1/g' | sort | uniq -c | sort -rn | grep -vE " +[1-4] "
 129 ws
  48 ml
  33 gq
  30 tk
  30 cf
  22 ga
  11 uz
  10 to
   6 xit.uz
   5 preveil.com

Without knowing the source or comprehensiveness of the list this doesn't strike me as widespread use.

@hynek
Copy link

hynek commented Dec 11, 2018

Hi everyone,

I've run into a domain that is registered but fails to decode: xn--irland-jc1c.com

InvalidCodepoint: Codepoint U+20AC at position 3 of 'ir€land' not allowed

You can check whois or click the link. :) So this domain totally exists – is this a registry fail? What should I do if I have to deal with this?

Since this is not about emoji, I wasn't sure whether to recycle this issue or open a new one?

@sethmlarson
Copy link
Collaborator

sethmlarson commented Dec 11, 2018

@hynek IDNA2008 disallows symbols and punctionation which € is in the category "Sc" or "Symbols Currency". If you have to deal with it you could use IDNA 2003:

>>> b"xn--irland-jc1c.com".decode("idna")
'ir€land.com'

It seems like a registry not following standards. :)

@john-parton
Copy link

Is this related? I really expected this to work. I'm not trying to decode random, possibly-invalid punycode, but rather encode a domain that works with nearly all browsers.

(ecom) john@john-work:~/Code/projects/ecom$ python3
Python 3.6.8 (default, Oct  7 2019, 12:59:55) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import idna
>>> idna.encode('i❤.ws')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/Code/venv/ecom/lib/python3.6/site-packages/idna/core.py", line 360, in encode
    s = alabel(label)
  File "/home/john/Code/venv/ecom/lib/python3.6/site-packages/idna/core.py", line 281, in alabel
    check_label(label)
  File "/home/john/Code/venv/ecom/lib/python3.6/site-packages/idna/core.py", line 261, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2764 at position 2 of 'i❤' not allowed
>>> idna.__version__
'2.9'

I initially came across this while using Scrapy. scrapy/scrapy#4330 (comment)

@john-parton
Copy link

The idna codec works, however.

Python 3.6.8 (default, Oct  7 2019, 12:59:55) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'i❤.ws'.encode('idna')
b'xn--i-7iq.ws'

@jrlevine
Copy link

IDNA2008 has different, mostly stricter, rules from IDNA2003 about what characters are allowed in domain names. Emoji aren't allowed, so the code is doing what it is supposed to.

@kjd
Copy link
Owner

kjd commented Feb 20, 2020

Confirming that i❤.ws failing is expected behavior, because it is an illegal domain. It was a deliberate design decision to prohibit symbols and emoji made in the IETF that resulted in updating the IDNA standard. It works with Python's built-in idna codec because that only supports the older deprecated standard from 2003.

@kjd
Copy link
Owner

kjd commented Feb 25, 2020

Closing this issue, I'll keep issue #18 open to track potential changes relating to this. Please add any additional commentary there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants