Making charset auto-detection strictly opt-in. #2152

tomchristie · 2022-03-29T10:26:00Z

From discussion #2083

Remove our auto-charset-guessing behaviour for cases where response.text is accessed, but the response has no explicit charset. The behaviour is too fuzzy and inconsistent to support as a default.

Instead we allow explicitly setting the default encoding. Furthermore we also allow

Drop response.apparent_encoding in favour of an explicit response.default_encoding.
Type of response.encoding becomes str, not Optional[str].
Support Response(default_encoding=...).
Support Client(default_encoding=...).
Remove charset_normalizer dependancy.
Support default_encoding="charset_normalizer" and default_encoding="chardet" to allow for auto-detection behaviours.
Documentation on text encodings.
Fix up test cases.

Optional follow-up that'd also fit neatly into this...

Support Response(encoding_errors=...).
Support Client(encoding_errors=...).

For example, in order to switch the default encoding errors from "replace" to "strict".

Documentation

The httpx package includes two optionally installable codecs, which provide support for character-set autodetection.

This can be useful for cases where you need the textual content of responses, rather than the raw bytewise content, if the Content-Type does not include a charset value, and the character set of the responses is unknown.

There are two commonly used packages for this in the Python ecosystem.

chardet: https://chardet.readthedocs.io/
charset_normalizer: https://charset-normalizer.readthedocs.io/

Using the default encoding.

To understand this better let's start by looking at the default behaviour without character-set auto-detection...

import httpx

# Instantiate a client with the default configuration.
client = httpx.Client()

# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "utf-8".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "utf-8".

This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is now so widely adopted.

Using an explicit encoding.

In some cases we might be making requests to a site, where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

import httpx

# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")

# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "shift-jis".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "shift-jis".

Using character set auto-detection.

In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

import codecs
import httpx


# Register the custom charset autodetect codecs.
# These codecs are then available as "chardet" and "charset_normalizer".
codecs.register(httpx.charset_autodetect)

# Instantiate a client using "chardet" character set autodetection.
# When no explicit charset information is present on the response,
# the chardet package will be used to make a best-guess attempt.
client = httpx.Client(default_encoding="chardet")

# Using the client with character-set autodetection enabled.
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "chardet".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "chardet" autodetection.

Changelog

The default encoding when no character set information is present on a response is now utf-8.
The charset_normalizer package is no longer a required dependancy.
The response.apparent_encoding property is no longer supported.
The response.encoding property now returns a string, rather than an optional string.
Added Client(default_encoding=...)
Added Client(default_encoding="chardet") for enabling character-set autodetection with the chardet package.
Added Client(default_encoding="charset_normalizer") for enabling character-set autodetection the charset_normalizer package.
Installation of character-set autodetection is via Python's standard codec registry, using codecs.register(httpx.charset_autodetect)

…ient(default_encoding='charset_normalizer')

tomchristie

(Self review)

I think the core "make charset auto-detection strictly opt-in" is a good plan.

I also think that adding support for default_encoding=... (and encoding_errors=...) makes sense.

However, I'm not convinced by having installable codecs for "charset_normalizer" and "chardet". It's a bit magic. Here's a couple of different alternatives, that seem simpler to me...

Add support for default_encoding=..., and allow it to either be a string, or a callable.

# A client with utf-8 as the fallback encoding.
client = httpx.Client(default_encoding="utf-8")

# A client with autodetect as the fallback encoding.
def autodetect(content: bytes) -> str:
    return charset_normalizer.detect(content).get("encoding", "utf-8")

client = httpx.Client(default_encoding=autodetect)

Add support for default_encoding=..., and use the string "autodetect" if you want to switch to auto detection.

# A client with utf-8 as the fallback encoding.
client = httpx.Client(default_encoding="utf-8")

# A client with autodetect as the fallback encoding.
client = httpx.Client(default_encoding="autodetect")

My preference out of these is probably (2). I prefer that it doesn't overload the default_encoding type into two different cases, and it's obvious to me from reading the code what the intent is.

tomchristie · 2022-04-05T14:32:58Z

Closing this off in favour of #2165

tomchristie added 13 commits March 29, 2022 11:11

Drop .apparent_encoding, in favour of .default_encoding

8de984f

Add support for httpx.Client(default_encoding='chardet') and httpx.Cl…

72f6b0a

…ient(default_encoding='charset_normalizer')

Docs for characterset autodetection

0beeced

Fix requirements

5f80256

Fix text decoding

5189473

Drop unused import

fb39159

Fix-up charset autodetection tests

22dddd2

Merge branch 'master' into diable-charset-autoguess

131357f

Linting

cc120fc

Add missing import to tests

05435f0

Fix up test cases

f549579

Drop now-incorrect portion of test case

3b83054

Add chardet test case, and add 'nocover' lines

6136c0f

tomchristie marked this pull request as ready for review March 30, 2022 13:36

tomchristie requested a review from a team March 30, 2022 13:37

Merge branch 'master' into diable-charset-autoguess

feee0f7

tomchristie commented Apr 5, 2022

View reviewed changes

tomchristie mentioned this pull request Apr 5, 2022

Make charset auto-detection optional. #2165

Merged

5 tasks

tomchristie closed this Apr 5, 2022

tomchristie deleted the diable-charset-autoguess branch April 5, 2022 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making charset auto-detection strictly opt-in. #2152

Making charset auto-detection strictly opt-in. #2152

tomchristie commented Mar 29, 2022 •

edited

tomchristie left a comment

tomchristie commented Apr 5, 2022

Making charset auto-detection strictly opt-in. #2152

Making charset auto-detection strictly opt-in. #2152

Conversation

tomchristie commented Mar 29, 2022 • edited

Documentation

Using the default encoding.

Using an explicit encoding.

Using character set auto-detection.

Changelog

tomchristie left a comment

Choose a reason for hiding this comment

tomchristie commented Apr 5, 2022

tomchristie commented Mar 29, 2022 •

edited