encode · tomchristie · Mar 29, 2022 · Mar 30, 2022 · Mar 30, 2022 · Mar 30, 2022
diff --git a/README.md b/README.md
@@ -128,7 +128,6 @@ The HTTPX project relies on these excellent libraries:
 * `httpcore` - The underlying transport implementation for `httpx`.
   * `h11` - HTTP/1.1 support.
 * `certifi` - SSL certificates.
-* `charset_normalizer` - Charset auto-detection.
 * `rfc3986` - URL parsing & normalization.
   * `idna` - Internationalized domain name support.
 * `sniffio` - Async library autodetection.
@@ -140,6 +139,7 @@ As well as these optional installs:
 * `rich` - Rich terminal support. *(Optional, with `httpx[cli]`)*
 * `click` - Command line client support. *(Optional, with `httpx[cli]`)*
 * `brotli` or `brotlicffi` - Decoding for "brotli" compressed responses. *(Optional, with `httpx[brotli]`)*
+* `chardet` or `charset_normalizer` - Optional charset auto-detection.
 
 A huge amount of credit is due to `requests` for the API layout that
 much of this work follows, as well as to `urllib3` for plenty of design

diff --git a/docs/advanced.md b/docs/advanced.md
@@ -145,6 +145,85 @@ URL('http://httpbin.org/headers')
 
 For a list of all available client parameters, see the [`Client`](api.md#client) API reference.
 
+---
+
+## Character set encodings and auto-detection
+
+The `httpx` package includes two optionally installable codecs, which provide support for character-set autodetection.
+
+This can be useful for cases where you need the textual content of responses, rather than the raw bytewise content, if the Content-Type does not include a `charset` value, and the character set of the responses is unknown.
+
+There are two commonly used packages for this in the Python ecosystem.
+
+* [chardet](https://chardet.readthedocs.io/)
+* [charset_normalizer](https://charset-normalizer.readthedocs.io/)
+
+### Using the default encoding
+
+To understand this better let's start by looking at the default behaviour without character-set auto-detection...
+
+```python
+import httpx
+
+# Instantiate a client with the default configuration.
+client = httpx.Client()
+
+# Using the client...
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "utf-8".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "utf-8".
+```
+
+This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is now so widely adopted.
+
+### Using an explicit encoding.
+
+In some cases we might be making requests to a site, where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.
+
+```python
+import httpx
+
+# Instantiate a client with a Japanese character set as the default encoding.
+client = httpx.Client(default_encoding="shift-jis")
+
+# Using the client...
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "shift-jis".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "shift-jis".
+```
+
+### Using character set auto-detection
+
+In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.
+
+```python
+import codecs
+import httpx
+
+
+# Register the custom charset autodetect codecs.
+# These codecs are then available as "chardet" and "charset_normalizer".
+codecs.register(httpx.charset_autodetect)
+
+# Instantiate a client using "chardet" character set autodetection.
+# When no explicit charset information is present on the response,
+# the chardet package will be used to make a best-guess attempt.
+client = httpx.Client(default_encoding="chardet")
+
+# Using the client with character-set autodetection enabled.
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "chardet".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "chardet" autodetection.
+```
+
+---
+
 ## Calling into Python Web Apps
 
 You can configure an `httpx` client to call directly into a Python web application using the WSGI protocol.

diff --git a/docs/index.md b/docs/index.md
@@ -109,7 +109,6 @@ The HTTPX project relies on these excellent libraries:
 * `httpcore` - The underlying transport implementation for `httpx`.
   * `h11` - HTTP/1.1 support.
 * `certifi` - SSL certificates.
-* `charset_normalizer` - Charset auto-detection.
 * `rfc3986` - URL parsing & normalization.
   * `idna` - Internationalized domain name support.
 * `sniffio` - Async library autodetection.
@@ -121,6 +120,7 @@ As well as these optional installs:
 * `rich` - Rich terminal support. *(Optional, with `httpx[cli]`)*
 * `click` - Command line client support. *(Optional, with `httpx[cli]`)*
 * `brotli` or `brotlicffi` - Decoding for "brotli" compressed responses. *(Optional, with `httpx[brotli]`)*
+* `chardet` or `charset_normalizer` - Optional charset auto-detection.
 
 A huge amount of credit is due to `requests` for the API layout that
 much of this work follows, as well as to `urllib3` for plenty of design

diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -73,7 +73,7 @@ You can inspect what encoding will be used to decode the response.
 ```
 
 In some cases the response may not contain an explicit encoding, in which case HTTPX
-will attempt to automatically determine an encoding to use.
+will default to using "utf-8".
 
 ```pycon
 >>> r.encoding

diff --git a/httpx/__init__.py b/httpx/__init__.py
@@ -2,6 +2,7 @@
 from ._api import delete, get, head, options, patch, post, put, request, stream
 from ._auth import Auth, BasicAuth, DigestAuth
 from ._client import USE_CLIENT_DEFAULT, AsyncClient, Client
+from ._codecs import charset_autodetect
 from ._config import Limits, Proxy, Timeout, create_ssl_context
 from ._content import ByteStream
 from ._exceptions import (
@@ -72,6 +73,7 @@ def main() -> None:  # type: ignore
     "BaseTransport",
     "BasicAuth",
     "ByteStream",
+    "charset_autodetect",
     "Client",
     "CloseError",
     "codes",

diff --git a/httpx/_client.py b/httpx/_client.py
@@ -166,6 +166,7 @@ def __init__(
         event_hooks: typing.Mapping[str, typing.List[typing.Callable]] = None,
         base_url: URLTypes = "",
         trust_env: bool = True,
+        default_encoding: str = "utf-8",
     ):
         event_hooks = {} if event_hooks is None else event_hooks
 
@@ -183,6 +184,7 @@ def __init__(
             "response": list(event_hooks.get("response", [])),
         }
         self._trust_env = trust_env
+        self._default_encoding = default_encoding
         self._netrc = NetRCInfo()
         self._state = ClientState.UNOPENED
 
@@ -997,6 +999,7 @@ def _send_single_request(self, request: Request) -> Response:
         response.stream = BoundSyncStream(
             response.stream, response=response, timer=timer
         )
+        response.default_encoding = self._default_encoding
         self.cookies.extract_cookies(response)
 
         status = f"{response.status_code} {response.reason_phrase}"
@@ -1701,6 +1704,7 @@ async def _send_single_request(self, request: Request) -> Response:
         response.stream = BoundAsyncStream(
             response.stream, response=response, timer=timer
         )
+        response.default_encoding = self._default_encoding
         self.cookies.extract_cookies(response)
 
         status = f"{response.status_code} {response.reason_phrase}"

diff --git a/httpx/_codecs.py b/httpx/_codecs.py
@@ -0,0 +1,153 @@
+"""
+The `httpx` package includes two optionally installable codecs,
+which provide support for character-set autodetection.
+
+This can be useful for cases where you need the textual content of responses,
+rather than the raw bytewise content, if the Content-Type does not include
+a `charset` value, and the character set of the responses is unknown.
+
+There are two commonly used packages for this in the Python ecosystem.
+
+* chardet: https://chardet.readthedocs.io/
+* charset_normalizer: https://charset-normalizer.readthedocs.io/
+
+---
+
+## Using the default encoding.
+
+To understand this better let's start by looking at the default behaviour
+without character-set auto-detection...
+
+```python
+import httpx
+
+# Instantiate a client with the default configuration.
+client = httpx.Client()
+
+# Using the client...
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "utf-8".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "utf-8".
+```
+
+This is normally absolutely fine. Most servers will respond with a properly
+formatted Content-Type header, including a charset encoding. And in most cases
+where no charset encoding is included, UTF-8 is very likely to be used,
+since it is now so widely adopted.
+
+## Using an explicit encoding.
+
+In some cases we might be making requests to a site, where no character
+set information is being set explicitly by the server, but we know what
+the encoding is. In this case it's best to set the default encoding
+explicitly on the client.
+
+```python
+import httpx
+
+# Instantiate a client with a Japanese character set as the default encoding.
+client = httpx.Client(default_encoding="shift-jis")
+
+# Using the client...
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "shift-jis".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "shift-jis".
+```
+
+## Using character set auto-detection.
+
+In cases where the server is not reliably including character set information,
+and where we don't know what encoding is being used, we can enable auto-detection
+to make a best-guess attempt when decoding from bytes to text.
+
+```python
+import codecs
+import httpx
+
+
+# Register the custom charset autodetect codecs.
+# These codecs are then available as "chardet" and "charset_normalizer".
+codecs.register(httpx.charset_autodetect)
+
+# Instantiate a client using "chardet" character set autodetection.
+# When no explicit charset information is present on the response,
+# the chardet package will be used to make a best-guess attempt.
+client = httpx.Client(default_encoding="chardet")
+
+# Using the client with character-set autodetection enabled.
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "chardet".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "chardet" autodetection.
+```
+"""
+import codecs
+import typing
+
+
+class ChardetCodec(codecs.Codec):
+    def encode(self, input, errors="strict"):  # type: ignore
+        raise RuntimeError(
+            "The 'chardet' codec does not support encoding."
+        )  # pragma: nocover
+
+    def decode(self, input, errors="strict"):  # type: ignore
+        import chardet
+
+        content: bytes = bytes(input)
+        info: dict = chardet.detect(content)
+        encoding: str = info.get("encoding") or "utf-8"
+        return content.decode(encoding, errors=errors), len(content)
+
+
+class CharsetNormalizerCodec(codecs.Codec):
+    def encode(self, input, errors="strict"):  # type: ignore
+        raise RuntimeError(
+            "The 'charset_normalizer' codec does not support encoding."
+        )  # pragma: nocover
+
+    def decode(self, input, errors="strict"):  # type: ignore
+        import charset_normalizer
+
+        content: bytes = bytes(input)
+        info: dict = charset_normalizer.detect(content)
+        encoding: str = info.get("encoding") or "utf-8"
+        return content.decode(encoding, errors=errors), len(content)
+
+
+class NullIncrementalEncoder(codecs.IncrementalEncoder):
+    def encode(self, input, final=False):  # type: ignore
+        raise RuntimeError("This codec does not support encoding.")  # pragma: nocover
+
+
+def charset_autodetect(encoding_name: str) -> typing.Optional[codecs.CodecInfo]:
+    if encoding_name == "chardet":
+        return codecs.CodecInfo(
+            name="chardet",
+            encode=ChardetCodec().encode,  # type: ignore
+            decode=ChardetCodec().decode,  # type: ignore
+            incrementalencoder=NullIncrementalEncoder,
+            # Note that for iter_text/aiter_text we *always* just fallback
+            # to using utf-8. Attempting character set autodetection in the
+            # incremental case can cause large amounts of buffering.
+            incrementaldecoder=codecs.getincrementaldecoder("utf-8"),
+        )
+
+    elif encoding_name == "charset_normalizer":
+        return codecs.CodecInfo(
+            name="charset_normalizer",
+            encode=CharsetNormalizerCodec().encode,  # type: ignore
+            decode=CharsetNormalizerCodec().decode,  # type: ignore
+            incrementalencoder=NullIncrementalEncoder,
+            # Note that for iter_text/aiter_text we *always* just fallback
+            # to using utf-8. Attempting character set autodetection in the
+            # incremental case can cause large amounts of buffering.
+            incrementaldecoder=codecs.getincrementaldecoder("utf-8"),
+        )
+
+    return None