Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

Closed
jakeogh opened this issue Feb 20, 2017 · 18 comments
Closed

idna-2.2: idna.encode('☃') does not return 'xn--n3h' #40

jakeogh opened this issue Feb 20, 2017 · 18 comments

Comments

@jakeogh
Copy link

jakeogh commented Feb 20, 2017

>>> import idna
>>> idna.encode('ドメイン.テスト')
b'xn--eckwd4c7c.xn--zckzah'
>>> idna.encode('☃')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.4/site-packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/usr/lib64/python3.4/site-packages/idna/core.py", line 276, in alabel
    check_label(label)
  File "/usr/lib64/python3.4/site-packages/idna/core.py", line 253, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2603 at position 1 of '☃' not allowed
@jakeogh
Copy link
Author

jakeogh commented Feb 20, 2017

Oops. It's not IDNA2008: http://unicode.org/cldr/utility/character.jsp?a=2603 closing.

@annevk
Copy link

annevk commented Mar 8, 2017

It's valid uts46 though which is what browsers use. You might want to reconsider this.

@nlevitt
Copy link

nlevitt commented Mar 8, 2017

@kjd
Copy link
Owner

kjd commented Mar 8, 2017

It's not valid UTS46 for IDNA 2008, it is only valid for IDNA 2003. Look for the "NV8" in the UTS46 table data. Now there may be an argument to add fall-through IDNA 2003 processing, but as of today this library only supports IDNA 2008.

@jakeogh
Copy link
Author

jakeogh commented Mar 9, 2017

+1 for optional 2003 fall-through, I run into real web corner cases that are IDNA2003.

@annevk
Copy link

annevk commented Mar 9, 2017

@kjd UTS46 (the actual document) doesn't have a processing mode where this code point is somehow rejected and this is the first implementation I have seen that does such a thing.

@jribbens
Copy link
Collaborator

jribbens commented Mar 9, 2017

That's because this isn't an implementation of UTS46, it's an implementation of IDNA2008. If for some reason you want only UTS46 and not IDNA2008 then presumably you can call idna.uts46_remap directly.

@nlevitt
Copy link

nlevitt commented Mar 9, 2017

idna.uts46_remap doesn't encode anything though...

@kjd
Copy link
Owner

kjd commented Mar 9, 2017

import idna
import encodings.idna as idna2003

def questionable_encode(s):

    try:
        return idna.encode(s, uts46=True)
    except idna.IDNAError:
        try:
            return idna2003.ToASCII(idna.uts46_remap(s))
        except:
            raise idna.IDNAError("Input string is supported by no flavour of IDNA")
>>> questionable_encode("\u2603")
"xn--n3h"

@nlevitt
Copy link

nlevitt commented Mar 9, 2017

Thanks! Stick that function in the library ;)

So is this a bona fide implementation of uts46? (Sorry for still not being totally clear on what the spec entails)

@nlevitt
Copy link

nlevitt commented Mar 10, 2017

Doh.

>>> questionable_encode("\u2603.net")
b'xn--.net-4g3b'

@kjd
Copy link
Owner

kjd commented Mar 10, 2017

This was just a quick function I rattled off the top of my head, not tested. You probably need to do a few more lines to break down the input string into individual labels to use the idna2003 portion. If we added support to do something like this in this library (see issue #18) then it will have a proper test suite etc.

@nlevitt
Copy link

nlevitt commented Mar 10, 2017

Ok, thanks. This version works for at least these two test inputs:

import idna
import encodings.idna as idna2003

def questionable_encode(s):
    try:
        return idna.encode(s, uts46=True)
    except idna.IDNAError:
        try:
            labels = idna.uts46_remap(s).split(".")
            punycode_labels = [idna2003.ToASCII(label) for label in labels]
            return b".".join(punycode_labels)
        except:
            raise idna.IDNAError("Input string is supported by no flavour of IDNA")
>>> questionable_encode("\u2603.net")
b'xn--n3h.net'
>>> questionable_encode("\u2603")
b'xn--n3h'

@nlevitt
Copy link

nlevitt commented Mar 10, 2017

If we added support to do something like this in this library (see issue #18) then it will have a proper test suite etc.

Yes please!

@nlevitt
Copy link

nlevitt commented Mar 15, 2017

FYI the function above mishandles faß.de. Correct result is fass.de

>>> questionable_encode('faß.de')
b'xn--fa-hia.de'

In fact a number of the examples from http://unicode.org/cldr/utility/idna.jsp don't work.

@annevk
Copy link

annevk commented Mar 15, 2017

No, fass.de is only the correct result for transitional mode, which is not what we want to align on.

@nlevitt
Copy link

nlevitt commented Mar 15, 2017

Oh. Well, chromium does fass.de, firefox does xn--fa-hia.de.

@annevk
Copy link

annevk commented Mar 15, 2017

Yeah I know, bugs have been filed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants