Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

because of idna2008 enforcement some real urls that work in the browser are now broken #3687

Closed
nlevitt opened this issue Nov 15, 2016 · 13 comments

Comments

@nlevitt
Copy link

nlevitt commented Nov 15, 2016

Because of idna2008 enforcement 2.12.0 some real urls that work in the browser are now broken.

For example:
http://☃.net/
http://xn--n3h.net/

My suggestion would be to try idna2008 first, catch exception, then try idna2003.

>>> requests.get('http://xn--n3h.net/')
Traceback (most recent call last):
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/models.py", line 370, in prepare_url
    host = idna.encode(host, uts46=True).decode('utf-8')
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/packages/idna/core.py", line 276, in alabel
    check_label(label)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/packages/idna/core.py", line 253, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
requests.packages.idna.core.InvalidCodepoint: Codepoint U+0027 at position 2 of "b'xn--n3h'" not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/sessions.py", line 474, in request
    prep = self.prepare_request(req)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/sessions.py", line 407, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/models.py", line 302, in prepare
    self.prepare_url(url, params)
  File "/Users/nlevitt/workspace/warcprox/warcprox-ve35/lib/python3.5/site-packages/requests/models.py", line 372, in prepare_url
    raise InvalidURL('URL has an invalid label.')
requests.exceptions.InvalidURL: URL has an invalid label.
@Lukasa
Copy link
Member

Lukasa commented Nov 15, 2016

Thanks for this report!

Ultimately, I'm not sure I agree. Fundamentally, those URIs will stop working at some point as browsers move over to IDNA 2008. They will have to move over time in at least some cases, because the .de domain mandates it, so there's no alternative there. So I have no objection to having "http://☃.net/".

Right now I think the real issue is that we attempt to IDNA-encode everything, and probably shouldn't. When a domain that is already IDNA-encoded is passed to us we should probably just leave it alone. The idna project is considering doing the same (see kjd/idna#27), but we can get there ahead of them by saying that for certain URIs we simply short-circuit the encoding.

The logic would have to be:

  • If the host portion of the URL begins with xn--
  • AND either
    • the URL is a bytestring that can be decoded using ASCII OR
    • the URL is a unicode string that can be encoded using ASCII
  • THEN we skip IDNA encoding
  • ELSE we do the IDNA encoding

That would, I think, cover this problem. How does that sound?

@nlevitt
Copy link
Author

nlevitt commented Nov 15, 2016

That sounds ok.

@nateprewitt
Copy link
Member

I believe we may not need to worry about the case where the URL is a bytestring. prepare_url transforms it into unicode at the beginning of the method and re-encodes it on the way out.

@nlevitt were you interested in providing a patch for this? If not, I believe I have a fix ready to go but will gladly defer to you.

@nlevitt
Copy link
Author

nlevitt commented Nov 16, 2016

Go for it @nateprewitt

@rockstar
Copy link

I've experienced this issue with pylxd using the unix socket interface and requests-unixsocket. In our case, we have to do parse('/path/to/unix.socket', safe='') as a raw unix path clearly doesn't substitute for a host. This breaks in requests 2.12+, as the urlencoded version %2Fpath%2Fto%2Funix.socket doesn't IDNA encode properly and raises an exception. I only comment here as it might make an interesting test case to prevent further regressions.

@jnozsc
Copy link

jnozsc commented Nov 21, 2016

In my case, one of my test url contain upper case letter, it worked with idna2003 but not with idna2008, if we have a fallback option that will be great.

Q: How does IDNA2008 differ from IDNA2003?
A: It disallows about eight thousand characters that used to be valid, including all uppercase characters, full/half-width variants, symbols, and punctuation. It also interprets four characters differently.

http://unicode.org/faq/idn.html

@Lukasa
Copy link
Member

Lukasa commented Nov 22, 2016

Uppercase letters should be fine, we've enabled a mapping mode that should make it safe. Can you provide the URL that fails?

@jnozsc
Copy link

jnozsc commented Nov 22, 2016

after investigate, I notice the subdomain I am testing is like

http://subdomain_1.example.com

and the _ breaks in idna2008

which already be covered in this issue https://github.com/kennethreitz/requests/issues/3683

@nateprewitt
Copy link
Member

@jnozsc Thanks for the example, we've had discussion regarding underscores in a separate thread which is now locked but your issue will be resolved with #3695 soon.

@quantenschaum
Copy link

I have the same problem, which I reported in kjd/idna#32, but it seems more to be an issue in requests than in idna.

@Lukasa's logic sounds right to me.

@Lukasa
Copy link
Member

Lukasa commented Nov 30, 2016

Should be fixed in Requests v2.12.2.

@snarfed
Copy link

snarfed commented May 8, 2021

FWIW, I use requests in project(s) where I want to handle web sites like these, eg https://todayinmarch2020.🦈🖥.ws/ , with domains that are valid IDNA2003 but invalid IDNA2008. To do that, I had to add in the domain2idna package and write code like this:

try:
  resp = requests.get(url, ...)
except requests.exceptions.InvalidURL:
  punycode = domain2idna(url)
  if punycode != url:
    # the domain is valid idna2003 but not idna2008. encode and try again.
    resp = requests.get(punycode, ...)

I get that these domains may break at some point in the future, but that's a big unknown, and they're registered and serving fine now. I don't have a specific proposal or stronger argument, I just wish I had a less awkward workaround. I have a wrapper around requests.*(), so fortunately I only had to do this in one place, but if i made direct requests calls everywhere and had to wrap every one, I'd be pretty unhappy.

Thanks in advance for listening!

@snarfed
Copy link

snarfed commented May 9, 2021

I ended up doing more research here, and I'm curious about a design decision. Was it a deliberate choice to build in just IDNA2008 and not full Punycode? Or was idna the only mature package you found, and it only supported IDNA2008? Or something else?

IDNA2008 evidently doesn't apply to all TLDs. Notably, unlike gTLDs, ccTLDs generally get to choose their own domain policies - background from Wikipedia, ICANN, a GoDaddy representative - and a handful of them have stuck with IDNA2003, UTS#46, or related variants. (Not to mention older proprietary schemes like ThaiURL 😁.)

Similarly, afaik domain owners can do whatever they want with their own subdomains. So thanks to Punycode, third level (and beyond) hostnames like https://🌏➡➡❤🔒.ayeshious.com and https://🔒🔒🔒.scotthelme.co.uk are not at risk of breaking due to gTLD regstries enforcing IDNA2008 on pay-level domain registrations.

I know you all thought this through back in 2016, eg here and in #3683 (comment), and settled on automatically encoding IDNA2008 and passing through already-encoded hostnames. That seems a bit surprising, since IDNA2008 is only a subset of the currently legal encodings. Mind elaborating on why you didn't either push all encoding onto users, or build in the other legal standard encodings too, notably IDNA2003?

(Thanks again for listening!)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants