because of idna2008 enforcement some real urls that work in the browser are now broken #3687

nlevitt · 2016-11-15T19:46:09Z

Because of idna2008 enforcement 2.12.0 some real urls that work in the browser are now broken.

For example:
http://☃.net/
http://xn--n3h.net/

My suggestion would be to try idna2008 first, catch exception, then try idna2003.

>>> requests.get('http://xn--n3h.net/')
Traceback (most recent call last):
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/models.py", line 370, in prepare_url
    host = idna.encode(host, uts46=True).decode('utf-8')
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/packages/idna/core.py", line 276, in alabel
    check_label(label)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/packages/idna/core.py", line 253, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
requests.packages.idna.core.InvalidCodepoint: Codepoint U+0027 at position 2 of "b'xn--n3h'" not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/sessions.py", line 474, in request
    prep = self.prepare_request(req)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/sessions.py", line 407, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/models.py", line 302, in prepare
    self.prepare_url(url, params)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/models.py", line 372, in prepare_url
    raise InvalidURL('URL has an invalid label.')
requests.exceptions.InvalidURL: URL has an invalid label.

The text was updated successfully, but these errors were encountered:

…s that (for better or worse, see https://github.com/kennethreitz/requests/issues/3687)

Lukasa · 2016-11-15T20:28:50Z

Thanks for this report!

Ultimately, I'm not sure I agree. Fundamentally, those URIs will stop working at some point as browsers move over to IDNA 2008. They will have to move over time in at least some cases, because the .de domain mandates it, so there's no alternative there. So I have no objection to having "http://☃.net/".

Right now I think the real issue is that we attempt to IDNA-encode everything, and probably shouldn't. When a domain that is already IDNA-encoded is passed to us we should probably just leave it alone. The idna project is considering doing the same (see kjd/idna#27), but we can get there ahead of them by saying that for certain URIs we simply short-circuit the encoding.

The logic would have to be:

If the host portion of the URL begins with xn--
AND either
- the URL is a bytestring that can be decoded using ASCII OR
- the URL is a unicode string that can be encoded using ASCII
THEN we skip IDNA encoding
ELSE we do the IDNA encoding

That would, I think, cover this problem. How does that sound?

nlevitt · 2016-11-15T20:48:08Z

That sounds ok.

nateprewitt · 2016-11-16T00:16:33Z

I believe we may not need to worry about the case where the URL is a bytestring. prepare_url transforms it into unicode at the beginning of the method and re-encodes it on the way out.

@nlevitt were you interested in providing a patch for this? If not, I believe I have a fix ready to go but will gladly defer to you.

nlevitt · 2016-11-16T06:28:10Z

Go for it @nateprewitt

rockstar · 2016-11-18T17:27:49Z

I've experienced this issue with pylxd using the unix socket interface and requests-unixsocket. In our case, we have to do parse('/path/to/unix.socket', safe='') as a raw unix path clearly doesn't substitute for a host. This breaks in requests 2.12+, as the urlencoded version %2Fpath%2Fto%2Funix.socket doesn't IDNA encode properly and raises an exception. I only comment here as it might make an interesting test case to prevent further regressions.

jnozsc · 2016-11-21T23:04:40Z

In my case, one of my test url contain upper case letter, it worked with idna2003 but not with idna2008, if we have a fallback option that will be great.

Q: How does IDNA2008 differ from IDNA2003?
A: It disallows about eight thousand characters that used to be valid, including all uppercase characters, full/half-width variants, symbols, and punctuation. It also interprets four characters differently.

http://unicode.org/faq/idn.html

Lukasa · 2016-11-22T08:42:40Z

Uppercase letters should be fine, we've enabled a mapping mode that should make it safe. Can you provide the URL that fails?

jnozsc · 2016-11-22T19:10:30Z

after investigate, I notice the subdomain I am testing is like

http://subdomain_1.example.com

and the _ breaks in idna2008

which already be covered in this issue https://github.com/kennethreitz/requests/issues/3683

nateprewitt · 2016-11-22T19:13:00Z

@jnozsc Thanks for the example, we've had discussion regarding underscores in a separate thread which is now locked but your issue will be resolved with #3695 soon.

quantenschaum · 2016-11-23T09:31:18Z

I have the same problem, which I reported in kjd/idna#32, but it seems more to be an issue in requests than in idna.

@Lukasa's logic sounds right to me.

Lukasa · 2016-11-30T15:07:36Z

Should be fixed in Requests v2.12.2.

…2008 eg https://todayinmarch2020.🦈🖥.ws/ , https://🕸💍.ws/ , https://🐷🔥.ws https://unicode.org/faq/idn.html#6 psf/requests#3687 kjd/idna#18 kjd/idna#40

snarfed · 2021-05-08T14:40:08Z

FWIW, I use requests in project(s) where I want to handle web sites like these, eg https://todayinmarch2020.🦈🖥.ws/ , with domains that are valid IDNA2003 but invalid IDNA2008. To do that, I had to add in the domain2idna package and write code like this:

try:
  resp = requests.get(url, ...)
except requests.exceptions.InvalidURL:
  punycode = domain2idna(url)
  if punycode != url:
    # the domain is valid idna2003 but not idna2008. encode and try again.
    resp = requests.get(punycode, ...)

I get that these domains may break at some point in the future, but that's a big unknown, and they're registered and serving fine now. I don't have a specific proposal or stronger argument, I just wish I had a less awkward workaround. I have a wrapper around requests.*(), so fortunately I only had to do this in one place, but if i made direct requests calls everywhere and had to wrap every one, I'd be pretty unhappy.

Thanks in advance for listening!

snarfed · 2021-05-09T05:16:33Z

I ended up doing more research here, and I'm curious about a design decision. Was it a deliberate choice to build in just IDNA2008 and not full Punycode? Or was idna the only mature package you found, and it only supported IDNA2008? Or something else?

IDNA2008 evidently doesn't apply to all TLDs. Notably, unlike gTLDs, ccTLDs generally get to choose their own domain policies - background from Wikipedia, ICANN, a GoDaddy representative - and a handful of them have stuck with IDNA2003, UTS#46, or related variants. (Not to mention older proprietary schemes like ThaiURL 😁.)

Similarly, afaik domain owners can do whatever they want with their own subdomains. So thanks to Punycode, third level (and beyond) hostnames like https://🌏➡➡❤🔒.ayeshious.com and https://🔒🔒🔒.scotthelme.co.uk are not at risk of breaking due to gTLD regstries enforcing IDNA2008 on pay-level domain registrations.

I know you all thought this through back in 2016, eg here and in #3683 (comment), and settled on automatically encoding IDNA2008 and passing through already-encoded hostnames. That seems a bit surprising, since IDNA2008 is only a subset of the currently legal encodings. Mind elaborating on why you didn't either push all encoding onto users, or build in the other legal standard encodings too, notably IDNA2003?

(Thanks again for listening!)

nlevitt referenced this issue in internetarchive/warcprox Nov 15, 2016

change tested idns to valid idna2008 now that requests 2.12.0 enforce…

3b16745

…s that (for better or worse, see https://github.com/kennethreitz/requests/issues/3687)

nateprewitt mentioned this issue Nov 16, 2016

idna bypass #3695

Merged

uSpike mentioned this issue Nov 16, 2016

requests 2.12+ breaks client api url canonical/pylxd#199

Closed

Lukasa closed this as completed Nov 30, 2016

snarfed mentioned this issue Jun 22, 2021

Relax IDNA2008 requirement? #5845

Closed

github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

because of idna2008 enforcement some real urls that work in the browser are now broken #3687

because of idna2008 enforcement some real urls that work in the browser are now broken #3687

nlevitt commented Nov 15, 2016

Lukasa commented Nov 15, 2016

nlevitt commented Nov 15, 2016

nateprewitt commented Nov 16, 2016

nlevitt commented Nov 16, 2016

rockstar commented Nov 18, 2016

jnozsc commented Nov 21, 2016 •

edited

Lukasa commented Nov 22, 2016

jnozsc commented Nov 22, 2016 •

edited

nateprewitt commented Nov 22, 2016

quantenschaum commented Nov 23, 2016

Lukasa commented Nov 30, 2016

snarfed commented May 8, 2021

snarfed commented May 9, 2021 •

edited

because of idna2008 enforcement some real urls that work in the browser are now broken #3687

because of idna2008 enforcement some real urls that work in the browser are now broken #3687

Comments

nlevitt commented Nov 15, 2016

Lukasa commented Nov 15, 2016

nlevitt commented Nov 15, 2016

nateprewitt commented Nov 16, 2016

nlevitt commented Nov 16, 2016

rockstar commented Nov 18, 2016

jnozsc commented Nov 21, 2016 • edited

Lukasa commented Nov 22, 2016

jnozsc commented Nov 22, 2016 • edited

nateprewitt commented Nov 22, 2016

quantenschaum commented Nov 23, 2016

Lukasa commented Nov 30, 2016

snarfed commented May 8, 2021

snarfed commented May 9, 2021 • edited

jnozsc commented Nov 21, 2016 •

edited

jnozsc commented Nov 22, 2016 •

edited

snarfed commented May 9, 2021 •

edited