idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

jakeogh · 2017-02-20T10:54:46Z

>>> import idna
>>> idna.encode('ドメイン.テスト')
b'xn--eckwd4c7c.xn--zckzah'
>>> idna.encode('☃')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.4/site-packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/usr/lib64/python3.4/site-packages/idna/core.py", line 276, in alabel
    check_label(label)
  File "/usr/lib64/python3.4/site-packages/idna/core.py", line 253, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2603 at position 1 of '☃' not allowed

The text was updated successfully, but these errors were encountered:

jakeogh · 2017-02-20T10:59:31Z

Oops. It's not IDNA2008: http://unicode.org/cldr/utility/character.jsp?a=2603 closing.

annevk · 2017-03-08T19:00:05Z

It's valid uts46 though which is what browsers use. You might want to reconsider this.

nlevitt · 2017-03-08T19:10:09Z

Nice utility http://unicode.org/cldr/utility/idna.jsp?a=%E2%98%83.net

kjd · 2017-03-08T20:44:50Z

It's not valid UTS46 for IDNA 2008, it is only valid for IDNA 2003. Look for the "NV8" in the UTS46 table data. Now there may be an argument to add fall-through IDNA 2003 processing, but as of today this library only supports IDNA 2008.

jakeogh · 2017-03-09T00:13:31Z

+1 for optional 2003 fall-through, I run into real web corner cases that are IDNA2003.

annevk · 2017-03-09T08:06:38Z

@kjd UTS46 (the actual document) doesn't have a processing mode where this code point is somehow rejected and this is the first implementation I have seen that does such a thing.

jribbens · 2017-03-09T11:12:11Z

That's because this isn't an implementation of UTS46, it's an implementation of IDNA2008. If for some reason you want only UTS46 and not IDNA2008 then presumably you can call idna.uts46_remap directly.

nlevitt · 2017-03-09T20:53:54Z

idna.uts46_remap doesn't encode anything though...

kjd · 2017-03-09T21:58:50Z

import idna
import encodings.idna as idna2003

def questionable_encode(s):

    try:
        return idna.encode(s, uts46=True)
    except idna.IDNAError:
        try:
            return idna2003.ToASCII(idna.uts46_remap(s))
        except:
            raise idna.IDNAError("Input string is supported by no flavour of IDNA")

>>> questionable_encode("\u2603")
"xn--n3h"

nlevitt · 2017-03-09T23:11:08Z

Thanks! Stick that function in the library ;)

So is this a bona fide implementation of uts46? (Sorry for still not being totally clear on what the spec entails)

nlevitt · 2017-03-10T00:13:26Z

Doh.

>>> questionable_encode("\u2603.net")
b'xn--.net-4g3b'

kjd · 2017-03-10T01:00:16Z

This was just a quick function I rattled off the top of my head, not tested. You probably need to do a few more lines to break down the input string into individual labels to use the idna2003 portion. If we added support to do something like this in this library (see issue #18) then it will have a proper test suite etc.

nlevitt · 2017-03-10T01:53:33Z

Ok, thanks. This version works for at least these two test inputs:

import idna
import encodings.idna as idna2003

def questionable_encode(s):
    try:
        return idna.encode(s, uts46=True)
    except idna.IDNAError:
        try:
            labels = idna.uts46_remap(s).split(".")
            punycode_labels = [idna2003.ToASCII(label) for label in labels]
            return b".".join(punycode_labels)
        except:
            raise idna.IDNAError("Input string is supported by no flavour of IDNA")

>>> questionable_encode("\u2603.net")
b'xn--n3h.net'
>>> questionable_encode("\u2603")
b'xn--n3h'

nlevitt · 2017-03-10T01:53:52Z

If we added support to do something like this in this library (see issue #18) then it will have a proper test suite etc.

Yes please!

nlevitt · 2017-03-15T01:45:45Z

FYI the function above mishandles faß.de. Correct result is fass.de

>>> questionable_encode('faß.de')
b'xn--fa-hia.de'

In fact a number of the examples from http://unicode.org/cldr/utility/idna.jsp don't work.

annevk · 2017-03-15T06:57:59Z

No, fass.de is only the correct result for transitional mode, which is not what we want to align on.

nlevitt · 2017-03-15T07:21:47Z

Oh. Well, chromium does fass.de, firefox does xn--fa-hia.de.

annevk · 2017-03-15T07:24:30Z

Yeah I know, bugs have been filed.

…2008 eg https://todayinmarch2020.🦈🖥.ws/ , https://🕸💍.ws/ , https://🐷🔥.ws https://unicode.org/faq/idn.html#6 psf/requests#3687 kjd/idna#18 kjd/idna#40

jakeogh closed this as completed Feb 20, 2017

nlevitt mentioned this issue Mar 8, 2017

IDNA2008/UTS46 whatwg/url#263

Closed

nlevitt mentioned this issue Mar 8, 2017

Alternative handling of illegal IDNs (such as domains with emojis) #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

jakeogh commented Feb 20, 2017

jakeogh commented Feb 20, 2017

annevk commented Mar 8, 2017

nlevitt commented Mar 8, 2017

kjd commented Mar 8, 2017

jakeogh commented Mar 9, 2017 •

edited

annevk commented Mar 9, 2017

jribbens commented Mar 9, 2017

nlevitt commented Mar 9, 2017

kjd commented Mar 9, 2017

nlevitt commented Mar 9, 2017

nlevitt commented Mar 10, 2017

kjd commented Mar 10, 2017

nlevitt commented Mar 10, 2017

nlevitt commented Mar 10, 2017 •

edited

nlevitt commented Mar 15, 2017 •

edited

annevk commented Mar 15, 2017

nlevitt commented Mar 15, 2017

annevk commented Mar 15, 2017

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

Comments

jakeogh commented Feb 20, 2017

jakeogh commented Feb 20, 2017

annevk commented Mar 8, 2017

nlevitt commented Mar 8, 2017

kjd commented Mar 8, 2017

jakeogh commented Mar 9, 2017 • edited

annevk commented Mar 9, 2017

jribbens commented Mar 9, 2017

nlevitt commented Mar 9, 2017

kjd commented Mar 9, 2017

nlevitt commented Mar 9, 2017

nlevitt commented Mar 10, 2017

kjd commented Mar 10, 2017

nlevitt commented Mar 10, 2017

nlevitt commented Mar 10, 2017 • edited

nlevitt commented Mar 15, 2017 • edited

annevk commented Mar 15, 2017

nlevitt commented Mar 15, 2017

annevk commented Mar 15, 2017

jakeogh commented Mar 9, 2017 •

edited

nlevitt commented Mar 10, 2017 •

edited

nlevitt commented Mar 15, 2017 •

edited