Make charset auto-detection optional. #2165

tomchristie · 2022-04-05T10:44:40Z

From discussion #2083

(This is a simpler alternative to a previous attempt at this functionality with #2152)

Add Response(..., default_encoding=...)
Add Client(..., default_encoding=...)
Documentation.
Switch from autodetect as the default to utf-8 as the default.
Remove charset_normalizer aas a dependency.

Changelog

Add support for Client(..., default_encoding=...)
Add support for Response(..., default_encoding=...)
The default behaviour is now to use "utf-8" instead of character-set autodetection.
charset_normalizer is no longer a dependancy.
To enable character-set autodetection, install chardet or charset_normalizer and use a callable for the default_encoding, like Client(default_encoding=autodetect). See the docs for an example.

adriangb · 2022-04-05T14:55:32Z

Overall I am very much 👍 for not having automagic behavior as the default, especially when there is a sane default like "utf-8".

Commenting on #2152 (review):

Given the two options presented, I would actually prefer (1).

I prefer that it doesn't overload the default_encoding type into two different cases, and it's obvious to me from reading the code what the intent is.

I think this particular sort of overloading is quite common and not hard to understand.
I also feel like mixing in strings that are encodings (e.g. "utf-8") with strings that are doing more complex stuff ("autodetect") and are not valid encodings is still "overloading" on a conceptual level, even if they types are technically the same.

I also think it leads to better error handling. With the current setup of "autodetect" an error because of a missing dependency may not bubble up until the response is read, which might be too late. A webserver would end up returning a 500 to it's client.

On the other hand if this was handled via a callback:

from starlette.charsets import autodetect   # error right here

def some_user_code():
    client = Client(default_encoding=autodetect.autodetect_charset)

(probably some better naming is needed for parameters and modules, this is just an example)

tomchristie · 2022-04-06T09:43:42Z

I would actually prefer (1)

My feeling is that practically the default_encoding="autodetect" just ends up being a neater developer experience. The alternative is slightly more fiddly to setup, is more verbose and less readable as a typing, and means we need an extra module in httpx, with an additional function exposed.

Another tiny little wrinkle is that if we want the import to fast fail, then it'd need to be a public module within httpx, and imported as a module. Eg from httpx.charset_autodetect import charset_autodetect. That's at odds with all the other API in httpx, which is all within private modules, and always accessed as import httpx ... # Use httpx.<...>.

With the current setup of "autodetect" an error because of a missing dependency may not bubble up until the response is read.

Valid point. It'd probably make sense here to attempt importing charset_normalizer at the point of Client(default_encoding="autodetect"), so that any error there is raised immediately.

adriangb · 2022-04-06T15:16:34Z

The point about the extra module/public function is valid. As for fiddleliness/verboseness, I think this is a pretty advanced use case, some verboseness is probably a good thing. One nice thing about it being a function is that developers can easily trace the logic: you can't right click a string and be taken to where it changes behavior.

In any case, I think the change to make the error show up when you instantiate the client would be a good move. In a web framework for example that would move the error from within the request response cycle (500 for their client) to startup time (likely just a failed deploy)

tomchristie · 2022-05-09T13:02:45Z

I've drafted up the documentation for an alternative, based on @adriangb's feedback here...

"Given the two options presented, I would actually prefer (1)."

Add support for default_encoding=..., and allow it to either be a string, or a callable.

I've written up below how I think the documentation should look if we want to have default_encoding support callables.

Note that the difference here is in the "Using character set auto-detection" section. The rest remains the same.

If we do want it to support callables, then I'd rather we don't include those implementations directly in httpx. This way around we fully remove the dependancy, which just feels neater to me.

Any reviews based on the documentation would be welcome. I'll draft up a full pull request based on these docs, but this seemed like a good starting point in the meantime...

Character set encodings and auto-detection

When accessing response.text, we need to decode the response bytes into a unicode text representation.

By default httpx will use "charset" information included in the response Content-Type header to determine how the response bytes should be decoded into text.

In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.

Using the default encoding

To understand this better let's start by looking at the default behaviour for text decoding...

import httpx
# Instantiate a client with the default configuration.
client = httpx.Client()
# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "utf-8".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "utf-8".

This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.

Using an explicit encoding

In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

import httpx
# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")
# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "shift-jis".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "shift-jis".

Using character set auto-detection

In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

To use auto-detection you need to set the default_encoding argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.

There are two widely used Python packages which both handle this functionality:

chardet - This is a well established package, and is a port of the auto-detection code in Mozilla.
charset-normalizer - A newer package, motivated by chardet, with a different approach.

Let's take a look at installing autodetection using one of these packages...

$ pip install httpx
$ pip install chardet

Once chardet is installed, we can configure a client to use character-set autodetection.

import httpx

def autodetect(content):
    return chardet.detect(content).get("encoding")

# Using a client with character-set autodetection enabled.
client = httpx.Client(default_encoding=autodetect)
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                            # the Content-Type charset, or else the auto-detected
                            # character set.
print(response.text)

bratao · 2022-05-09T14:38:51Z

Just an +1 from me. Debugging httpx on a high speed crawler, the decode (httpx/_decoders.py:76) is a bottleneck. I would like the option to not decode it automatically to speedup my code.

adriangb · 2022-05-09T14:44:04Z

That looks nice Tom, thank you for writing it up this clearly. I think the example makes it easy to understand. And like you say, completely removing the dependency would be very neat.

Is the content parameter to autodetect the entire content body, in bytes? I suppose it only gets called if the user called .text, so it wouldn't read the body into memory unless the user requested the entire body anyway, right?

tomchristie · 2022-05-10T10:56:07Z

@bratao Gotcha - that's a different consideration, which I've opened a discussion for... #2220

tomchristie · 2022-05-10T11:05:12Z

Is the content parameter to autodetect the entire content body, in bytes?

Yes.

so it wouldn't read the body into memory unless the user requested the entire body anyway, right?

Yes.

The one additional consideration here is what to do when the user calls iter_text. In that case we don't have the entire content available before we need to take a decision on which text encoding to use. Currently we don't do autodetect in this case and just fallback to utf-8 always. We probably just want to stick with that behaviour. (At least for now)

tomchristie · 2022-05-16T09:36:55Z

Now updated based on @adriangb's feedback.

…d upstream This is pushed to [community-testing] due to behavior changes [1] in this version. More testing needed. * charset-normalizer is no longer needed since [1] * rich is optional - used for CLI only * Fill optdepends per namcap reports * Remove the CVE fix, which is included in this version * Workaround test failures from newer pytest-asyncio [1] encode/httpx#2165 git-svn-id: file:///srv/repos/svn-community/svn@1210123 9fca08f4-af9d-4005-b8df-a31f2cc04f65

tomchristie added 6 commits April 5, 2022 10:55

Add Response(..., default_encoding=...)

cb40cb7

Add tests for Response(..., default_encoding=...)

dcae58d

Add Client(..., default_encoding=...)

c7c2e8e

Switch default encoding to 'utf-8' instead of 'autodetect'

92d5d0f

Make charset_normalizer an optional dependancy, not a mandatory one.

40bb084

Documentation

fb43208

tomchristie marked this pull request as ready for review April 5, 2022 14:32

tomchristie mentioned this pull request Apr 5, 2022

Making charset auto-detection strictly opt-in. #2152

Closed

8 tasks

tomchristie requested a review from a team April 5, 2022 14:33

Merge branch 'master' into add-default-encoding-support

53139fc

tomchristie mentioned this pull request May 9, 2022

Version 0.23.0 #2214

Merged

tomchristie added 7 commits May 10, 2022 13:42

Merge branch 'master' into add-default-encoding-support

5efcfdd

Use callable for default_encoding

8ad9460

Update tests for new charset autodetection API

1251c8a

Update docs for new charset autodetection API

d9fb691

Merge branch 'master' into add-default-encoding-support

9f78c84

Update requirements

069e735

Drop charset_normalizer from requirements

e5cf3d2

tomchristie added 2 commits May 17, 2022 13:52

Merge branch 'master' into add-default-encoding-support

d549589

Merge branch 'master' into add-default-encoding-support

4a7d0b5

tomchristie merged commit 1c33a28 into master May 23, 2022

tomchristie deleted the add-default-encoding-support branch May 23, 2022 15:27

deedy5 mentioned this pull request Apr 1, 2024

Introduce default_encoding parameter to [set|autodetect] the encoding if the charset is missing from the headers yifeikong/curl_cffi#284

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make charset auto-detection optional. #2165

Make charset auto-detection optional. #2165

tomchristie commented Apr 5, 2022 •

edited

adriangb commented Apr 5, 2022

tomchristie commented Apr 6, 2022

adriangb commented Apr 6, 2022

tomchristie commented May 9, 2022

bratao commented May 9, 2022

adriangb commented May 9, 2022 •

edited

tomchristie commented May 10, 2022

tomchristie commented May 10, 2022

tomchristie commented May 16, 2022

Make charset auto-detection optional. #2165

Make charset auto-detection optional. #2165

Conversation

tomchristie commented Apr 5, 2022 • edited

Changelog

adriangb commented Apr 5, 2022

tomchristie commented Apr 6, 2022

adriangb commented Apr 6, 2022

tomchristie commented May 9, 2022

Character set encodings and auto-detection

Using the default encoding

Using an explicit encoding

Using character set auto-detection

bratao commented May 9, 2022

adriangb commented May 9, 2022 • edited

tomchristie commented May 10, 2022

tomchristie commented May 10, 2022

tomchristie commented May 16, 2022

tomchristie commented Apr 5, 2022 •

edited

adriangb commented May 9, 2022 •

edited