New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making charset auto-detection strictly opt-in. #2152
Conversation
…ient(default_encoding='charset_normalizer')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Self review)
I think the core "make charset auto-detection strictly opt-in" is a good plan.
I also think that adding support for default_encoding=...
(and encoding_errors=...
) makes sense.
However, I'm not convinced by having installable codecs for "charset_normalizer" and "chardet". It's a bit magic. Here's a couple of different alternatives, that seem simpler to me...
- Add support for
default_encoding=...
, and allow it to either be a string, or a callable.
# A client with utf-8 as the fallback encoding.
client = httpx.Client(default_encoding="utf-8")
# A client with autodetect as the fallback encoding.
def autodetect(content: bytes) -> str:
return charset_normalizer.detect(content).get("encoding", "utf-8")
client = httpx.Client(default_encoding=autodetect)
- Add support for
default_encoding=...
, and use the string "autodetect" if you want to switch to auto detection.
# A client with utf-8 as the fallback encoding.
client = httpx.Client(default_encoding="utf-8")
# A client with autodetect as the fallback encoding.
client = httpx.Client(default_encoding="autodetect")
My preference out of these is probably (2). I prefer that it doesn't overload the default_encoding
type into two different cases, and it's obvious to me from reading the code what the intent is.
Closing this off in favour of #2165 |
From discussion #2083
Remove our auto-charset-guessing behaviour for cases where
response.text
is accessed, but the response has no explicit charset. The behaviour is too fuzzy and inconsistent to support as a default.Instead we allow explicitly setting the default encoding. Furthermore we also allow
response.apparent_encoding
in favour of an explicitresponse.default_encoding
.response.encoding
becomesstr
, notOptional[str]
.Response(default_encoding=...)
.Client(default_encoding=...)
.charset_normalizer
dependancy.default_encoding="charset_normalizer"
anddefault_encoding="chardet"
to allow for auto-detection behaviours.Optional follow-up that'd also fit neatly into this...
Response(encoding_errors=...)
.Client(encoding_errors=...)
.For example, in order to switch the default encoding errors from
"replace"
to"strict"
.Documentation
The
httpx
package includes two optionally installable codecs, which provide support for character-set autodetection.This can be useful for cases where you need the textual content of responses, rather than the raw bytewise content, if the Content-Type does not include a
charset
value, and the character set of the responses is unknown.There are two commonly used packages for this in the Python ecosystem.
Using the default encoding.
To understand this better let's start by looking at the default behaviour without character-set auto-detection...
This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is now so widely adopted.
Using an explicit encoding.
In some cases we might be making requests to a site, where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.
Using character set auto-detection.
In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.
Changelog
utf-8
.charset_normalizer
package is no longer a required dependancy.response.apparent_encoding
property is no longer supported.response.encoding
property now returns a string, rather than an optional string.Client(default_encoding=...)
Client(default_encoding="chardet")
for enabling character-set autodetection with thechardet
package.Client(default_encoding="charset_normalizer")
for enabling character-set autodetection thecharset_normalizer
package.codecs.register(httpx.charset_autodetect)