Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making charset auto-detection strictly opt-in. #2152

Closed
wants to merge 14 commits into from

Conversation

tomchristie
Copy link
Member

@tomchristie tomchristie commented Mar 29, 2022

From discussion #2083

Remove our auto-charset-guessing behaviour for cases where response.text is accessed, but the response has no explicit charset. The behaviour is too fuzzy and inconsistent to support as a default.

Instead we allow explicitly setting the default encoding. Furthermore we also allow

  • Drop response.apparent_encoding in favour of an explicit response.default_encoding.
  • Type of response.encoding becomes str, not Optional[str].
  • Support Response(default_encoding=...).
  • Support Client(default_encoding=...).
  • Remove charset_normalizer dependancy.
  • Support default_encoding="charset_normalizer" and default_encoding="chardet" to allow for auto-detection behaviours.
  • Documentation on text encodings.
  • Fix up test cases.

Optional follow-up that'd also fit neatly into this...

  • Support Response(encoding_errors=...).
  • Support Client(encoding_errors=...).

For example, in order to switch the default encoding errors from "replace" to "strict".


Documentation

The httpx package includes two optionally installable codecs, which provide support for character-set autodetection.

This can be useful for cases where you need the textual content of responses, rather than the raw bytewise content, if the Content-Type does not include a charset value, and the character set of the responses is unknown.

There are two commonly used packages for this in the Python ecosystem.


Using the default encoding.

To understand this better let's start by looking at the default behaviour without character-set auto-detection...

import httpx

# Instantiate a client with the default configuration.
client = httpx.Client()

# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "utf-8".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "utf-8".

This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is now so widely adopted.

Using an explicit encoding.

In some cases we might be making requests to a site, where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

import httpx

# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")

# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "shift-jis".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "shift-jis".

Using character set auto-detection.

In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

import codecs
import httpx


# Register the custom charset autodetect codecs.
# These codecs are then available as "chardet" and "charset_normalizer".
codecs.register(httpx.charset_autodetect)

# Instantiate a client using "chardet" character set autodetection.
# When no explicit charset information is present on the response,
# the chardet package will be used to make a best-guess attempt.
client = httpx.Client(default_encoding="chardet")

# Using the client with character-set autodetection enabled.
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "chardet".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "chardet" autodetection.

Changelog

  • The default encoding when no character set information is present on a response is now utf-8.
  • The charset_normalizer package is no longer a required dependancy.
  • The response.apparent_encoding property is no longer supported.
  • The response.encoding property now returns a string, rather than an optional string.
  • Added Client(default_encoding=...)
  • Added Client(default_encoding="chardet") for enabling character-set autodetection with the chardet package.
  • Added Client(default_encoding="charset_normalizer") for enabling character-set autodetection the charset_normalizer package.
  • Installation of character-set autodetection is via Python's standard codec registry, using codecs.register(httpx.charset_autodetect)

@tomchristie tomchristie marked this pull request as ready for review March 30, 2022 13:36
@tomchristie tomchristie requested a review from a team March 30, 2022 13:37
Copy link
Member Author

@tomchristie tomchristie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Self review)

I think the core "make charset auto-detection strictly opt-in" is a good plan.

I also think that adding support for default_encoding=... (and encoding_errors=...) makes sense.

However, I'm not convinced by having installable codecs for "charset_normalizer" and "chardet". It's a bit magic. Here's a couple of different alternatives, that seem simpler to me...

  1. Add support for default_encoding=..., and allow it to either be a string, or a callable.
# A client with utf-8 as the fallback encoding.
client = httpx.Client(default_encoding="utf-8")

# A client with autodetect as the fallback encoding.
def autodetect(content: bytes) -> str:
    return charset_normalizer.detect(content).get("encoding", "utf-8")

client = httpx.Client(default_encoding=autodetect)
  1. Add support for default_encoding=..., and use the string "autodetect" if you want to switch to auto detection.
# A client with utf-8 as the fallback encoding.
client = httpx.Client(default_encoding="utf-8")

# A client with autodetect as the fallback encoding.
client = httpx.Client(default_encoding="autodetect")

My preference out of these is probably (2). I prefer that it doesn't overload the default_encoding type into two different cases, and it's obvious to me from reading the code what the intent is.

@tomchristie
Copy link
Member Author

Closing this off in favour of #2165

@tomchristie tomchristie closed this Apr 5, 2022
@tomchristie tomchristie deleted the diable-charset-autoguess branch April 5, 2022 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant