Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when requesting URL which contains emojis or certain characters #6453

Open
emilio-cea opened this issue May 10, 2023 · 6 comments
Open

Comments

@emilio-cea
Copy link

When performing a GET request to a URL which contains emojis, a redirection occurs in which the location header also contains emojis. From the stacktrace error I believe there's an error when handling redirects if the URL contains certain characters or emojis on it, but further investigation could yield better results.

This is the URL in question: https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/

It can be found on a forum page, where the source HTML contains these emojis and characters:
https://www.nulled.to/forum/9-tutorials-guides-ebooks-etc/page-779?prune_day=100&sort_by=Z-A&sort_key=start_date&topicfilter=all

Note that when making the request to the URL, since it's a Cloudflare protected forum, an error 403 can happen in which case, the error mentioned further below does not happen. That's why it leads me to believe the error happens only when a redirection occurs, as the location header which requests is trying to fetch also contains emojis and then, the encoding error happens.

Expected Result

Making the request to the site successfully and returning HTML source code.

Actual Result

An error was raised:
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-50: invalid continuation byte

This is the stacktrace:

File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 61, in request
  return session.request(method=method, url=url, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
  resp = self.send(prep, **send_kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in send
  history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in <listcomp>
  history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 150, in resolve_redirects
  url = self.get_redirect_target(resp)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 116, in get_redirect_target
  return to_native_string(location, 'utf8')
File "workdir/env/lib/python3.7/site-packages/requests/_internal_utils.py", line 25, in to_native_string
  out = string.decode(encoding)

Reproduction Steps

import requests
url = "https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/"
r=requests.get(url)
print(r.content)

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "4.0.0"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.10"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.3"
  },
  "platform": {
    "release": "4.19.0-22-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.25.1"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "1.26.3"
  },
  "using_pyopenssl": false
}
@sigmavirus24
Copy link
Contributor

This is related to #3969. We're trying to use utf8 to handle the redirect URL but the translation from bytes to utf8 string is what's failing.

I suspect there's something other than emoji in that url

@emilio-cea
Copy link
Author

It seems like the replacement character: U+FFFD REPLACEMENT CHARACTER

@emilio-cea
Copy link
Author

And I've seen there are a bunch of issues related to this. The best solution would be to know what encoding the browser does and try to replicate it because on Firefox for instance, it is encoded with something different than UTF8 and no redirections happen but alas, I have not been able to find what encoding is being used

@harris-ahmad
Copy link

We could fix this by maintaining a list of common encoding types. Wrap the relevant piece of code that is responsible for encoding in a try/ except block. Loop through every encoding type in the array and try to encode the given URL with it. Whatever works will break the loop, and the code will be pretty much bug free.

@sigmavirus24
Copy link
Contributor

Well depending upon what part of the world you're in determines the most common encodings you might encounter. So we'll be looping for a while which would drastically hurt performance.

@MozarM
Copy link

MozarM commented Jun 22, 2023

Try to reproduce the error with the same and different URL which contains emojis or certain characters, seems there is a issue with given URL. I can able to get the content with the different URL containing emojis with specific encoding type.

import requests
url = "https://www.example.com/🌟emoji-example🌟"
r = requests.get(url)
content = r.content.decode('ISO-8859-1')
print(content)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants