Error when requesting URL which contains emojis or certain characters #6453

emilio-cea · 2023-05-10T09:50:51Z

When performing a GET request to a URL which contains emojis, a redirection occurs in which the location header also contains emojis. From the stacktrace error I believe there's an error when handling redirects if the URL contains certain characters or emojis on it, but further investigation could yield better results.

This is the URL in question: https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/

It can be found on a forum page, where the source HTML contains these emojis and characters:
https://www.nulled.to/forum/9-tutorials-guides-ebooks-etc/page-779?prune_day=100&sort_by=Z-A&sort_key=start_date&topicfilter=all

Note that when making the request to the URL, since it's a Cloudflare protected forum, an error 403 can happen in which case, the error mentioned further below does not happen. That's why it leads me to believe the error happens only when a redirection occurs, as the location header which requests is trying to fetch also contains emojis and then, the encoding error happens.

Expected Result

Making the request to the site successfully and returning HTML source code.

Actual Result

An error was raised:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-50: invalid continuation byte

This is the stacktrace:

File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 61, in request
  return session.request(method=method, url=url, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
  resp = self.send(prep, **send_kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in send
  history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in <listcomp>
  history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 150, in resolve_redirects
  url = self.get_redirect_target(resp)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 116, in get_redirect_target
  return to_native_string(location, 'utf8')
File "workdir/env/lib/python3.7/site-packages/requests/_internal_utils.py", line 25, in to_native_string
  out = string.decode(encoding)

Reproduction Steps

import requests
url = "https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/"
r=requests.get(url)
print(r.content)

System Information

$ python -m requests.help

{
  "chardet": {
    "version": "4.0.0"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.10"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.3"
  },
  "platform": {
    "release": "4.19.0-22-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.25.1"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "1.26.3"
  },
  "using_pyopenssl": false
}

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2023-05-10T11:10:49Z

This is related to #3969. We're trying to use utf8 to handle the redirect URL but the translation from bytes to utf8 string is what's failing.

I suspect there's something other than emoji in that url

emilio-cea · 2023-05-10T11:12:33Z

It seems like the replacement character: U+FFFD REPLACEMENT CHARACTER

emilio-cea · 2023-05-10T11:26:15Z

And I've seen there are a bunch of issues related to this. The best solution would be to know what encoding the browser does and try to replicate it because on Firefox for instance, it is encoded with something different than UTF8 and no redirections happen but alas, I have not been able to find what encoding is being used

harris-ahmad · 2023-06-04T00:26:07Z

We could fix this by maintaining a list of common encoding types. Wrap the relevant piece of code that is responsible for encoding in a try/ except block. Loop through every encoding type in the array and try to encode the given URL with it. Whatever works will break the loop, and the code will be pretty much bug free.

sigmavirus24 · 2023-06-04T01:10:31Z

Well depending upon what part of the world you're in determines the most common encodings you might encounter. So we'll be looping for a while which would drastically hurt performance.

MozarM · 2023-06-22T09:28:35Z

Try to reproduce the error with the same and different URL which contains emojis or certain characters, seems there is a issue with given URL. I can able to get the content with the different URL containing emojis with specific encoding type.

import requests
url = "https://www.example.com/🌟emoji-example🌟"
r = requests.get(url)
content = r.content.decode('ISO-8859-1')
print(content)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when requesting URL which contains emojis or certain characters #6453

Error when requesting URL which contains emojis or certain characters #6453

emilio-cea commented May 10, 2023

sigmavirus24 commented May 10, 2023

emilio-cea commented May 10, 2023

emilio-cea commented May 10, 2023

harris-ahmad commented Jun 4, 2023

sigmavirus24 commented Jun 4, 2023

MozarM commented Jun 22, 2023

Error when requesting URL which contains emojis or certain characters #6453

Error when requesting URL which contains emojis or certain characters #6453

Comments

emilio-cea commented May 10, 2023

Expected Result

Actual Result

Reproduction Steps

System Information

sigmavirus24 commented May 10, 2023

emilio-cea commented May 10, 2023

emilio-cea commented May 10, 2023

harris-ahmad commented Jun 4, 2023

sigmavirus24 commented Jun 4, 2023

MozarM commented Jun 22, 2023