Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making charset auto-detection strictly opt-in. #2152

Closed
wants to merge 14 commits into from
Closed
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -128,7 +128,6 @@ The HTTPX project relies on these excellent libraries:
* `httpcore` - The underlying transport implementation for `httpx`.
* `h11` - HTTP/1.1 support.
* `certifi` - SSL certificates.
* `charset_normalizer` - Charset auto-detection.
* `rfc3986` - URL parsing & normalization.
* `idna` - Internationalized domain name support.
* `sniffio` - Async library autodetection.
Expand All @@ -140,6 +139,7 @@ As well as these optional installs:
* `rich` - Rich terminal support. *(Optional, with `httpx[cli]`)*
* `click` - Command line client support. *(Optional, with `httpx[cli]`)*
* `brotli` or `brotlicffi` - Decoding for "brotli" compressed responses. *(Optional, with `httpx[brotli]`)*
* `chardet` or `charset_normalizer` - Optional charset auto-detection.

A huge amount of credit is due to `requests` for the API layout that
much of this work follows, as well as to `urllib3` for plenty of design
Expand Down
79 changes: 79 additions & 0 deletions docs/advanced.md
Expand Up @@ -145,6 +145,85 @@ URL('http://httpbin.org/headers')

For a list of all available client parameters, see the [`Client`](api.md#client) API reference.

---

## Character set encodings and auto-detection

The `httpx` package includes two optionally installable codecs, which provide support for character-set autodetection.

This can be useful for cases where you need the textual content of responses, rather than the raw bytewise content, if the Content-Type does not include a `charset` value, and the character set of the responses is unknown.

There are two commonly used packages for this in the Python ecosystem.

* [chardet](https://chardet.readthedocs.io/)
* [charset_normalizer](https://charset-normalizer.readthedocs.io/)

### Using the default encoding

To understand this better let's start by looking at the default behaviour without character-set auto-detection...

```python
import httpx

# Instantiate a client with the default configuration.
client = httpx.Client()

# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "utf-8".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "utf-8".
```

This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is now so widely adopted.

### Using an explicit encoding.

In some cases we might be making requests to a site, where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

```python
import httpx

# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")

# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "shift-jis".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "shift-jis".
```

### Using character set auto-detection

In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

```python
import codecs
import httpx


# Register the custom charset autodetect codecs.
# These codecs are then available as "chardet" and "charset_normalizer".
codecs.register(httpx.charset_autodetect)

# Instantiate a client using "chardet" character set autodetection.
# When no explicit charset information is present on the response,
# the chardet package will be used to make a best-guess attempt.
client = httpx.Client(default_encoding="chardet")

# Using the client with character-set autodetection enabled.
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "chardet".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "chardet" autodetection.
```

---

## Calling into Python Web Apps

You can configure an `httpx` client to call directly into a Python web application using the WSGI protocol.
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Expand Up @@ -109,7 +109,6 @@ The HTTPX project relies on these excellent libraries:
* `httpcore` - The underlying transport implementation for `httpx`.
* `h11` - HTTP/1.1 support.
* `certifi` - SSL certificates.
* `charset_normalizer` - Charset auto-detection.
* `rfc3986` - URL parsing & normalization.
* `idna` - Internationalized domain name support.
* `sniffio` - Async library autodetection.
Expand All @@ -121,6 +120,7 @@ As well as these optional installs:
* `rich` - Rich terminal support. *(Optional, with `httpx[cli]`)*
* `click` - Command line client support. *(Optional, with `httpx[cli]`)*
* `brotli` or `brotlicffi` - Decoding for "brotli" compressed responses. *(Optional, with `httpx[brotli]`)*
* `chardet` or `charset_normalizer` - Optional charset auto-detection.

A huge amount of credit is due to `requests` for the API layout that
much of this work follows, as well as to `urllib3` for plenty of design
Expand Down
2 changes: 1 addition & 1 deletion docs/quickstart.md
Expand Up @@ -73,7 +73,7 @@ You can inspect what encoding will be used to decode the response.
```

In some cases the response may not contain an explicit encoding, in which case HTTPX
will attempt to automatically determine an encoding to use.
will default to using "utf-8".

```pycon
>>> r.encoding
Expand Down
2 changes: 2 additions & 0 deletions httpx/__init__.py
Expand Up @@ -2,6 +2,7 @@
from ._api import delete, get, head, options, patch, post, put, request, stream
from ._auth import Auth, BasicAuth, DigestAuth
from ._client import USE_CLIENT_DEFAULT, AsyncClient, Client
from ._codecs import charset_autodetect
from ._config import Limits, Proxy, Timeout, create_ssl_context
from ._content import ByteStream
from ._exceptions import (
Expand Down Expand Up @@ -72,6 +73,7 @@ def main() -> None: # type: ignore
"BaseTransport",
"BasicAuth",
"ByteStream",
"charset_autodetect",
"Client",
"CloseError",
"codes",
Expand Down
4 changes: 4 additions & 0 deletions httpx/_client.py
Expand Up @@ -166,6 +166,7 @@ def __init__(
event_hooks: typing.Mapping[str, typing.List[typing.Callable]] = None,
base_url: URLTypes = "",
trust_env: bool = True,
default_encoding: str = "utf-8",
):
event_hooks = {} if event_hooks is None else event_hooks

Expand All @@ -183,6 +184,7 @@ def __init__(
"response": list(event_hooks.get("response", [])),
}
self._trust_env = trust_env
self._default_encoding = default_encoding
self._netrc = NetRCInfo()
self._state = ClientState.UNOPENED

Expand Down Expand Up @@ -997,6 +999,7 @@ def _send_single_request(self, request: Request) -> Response:
response.stream = BoundSyncStream(
response.stream, response=response, timer=timer
)
response.default_encoding = self._default_encoding
self.cookies.extract_cookies(response)

status = f"{response.status_code} {response.reason_phrase}"
Expand Down Expand Up @@ -1701,6 +1704,7 @@ async def _send_single_request(self, request: Request) -> Response:
response.stream = BoundAsyncStream(
response.stream, response=response, timer=timer
)
response.default_encoding = self._default_encoding
self.cookies.extract_cookies(response)

status = f"{response.status_code} {response.reason_phrase}"
Expand Down
153 changes: 153 additions & 0 deletions httpx/_codecs.py
@@ -0,0 +1,153 @@
"""
The `httpx` package includes two optionally installable codecs,
which provide support for character-set autodetection.

This can be useful for cases where you need the textual content of responses,
rather than the raw bytewise content, if the Content-Type does not include
a `charset` value, and the character set of the responses is unknown.

There are two commonly used packages for this in the Python ecosystem.

* chardet: https://chardet.readthedocs.io/
* charset_normalizer: https://charset-normalizer.readthedocs.io/

---

## Using the default encoding.

To understand this better let's start by looking at the default behaviour
without character-set auto-detection...

```python
import httpx

# Instantiate a client with the default configuration.
client = httpx.Client()

# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "utf-8".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "utf-8".
```

This is normally absolutely fine. Most servers will respond with a properly
formatted Content-Type header, including a charset encoding. And in most cases
where no charset encoding is included, UTF-8 is very likely to be used,
since it is now so widely adopted.

## Using an explicit encoding.

In some cases we might be making requests to a site, where no character
set information is being set explicitly by the server, but we know what
the encoding is. In this case it's best to set the default encoding
explicitly on the client.

```python
import httpx

# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")

# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "shift-jis".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "shift-jis".
```

## Using character set auto-detection.

In cases where the server is not reliably including character set information,
and where we don't know what encoding is being used, we can enable auto-detection
to make a best-guess attempt when decoding from bytes to text.

```python
import codecs
import httpx


# Register the custom charset autodetect codecs.
# These codecs are then available as "chardet" and "charset_normalizer".
codecs.register(httpx.charset_autodetect)

# Instantiate a client using "chardet" character set autodetection.
# When no explicit charset information is present on the response,
# the chardet package will be used to make a best-guess attempt.
client = httpx.Client(default_encoding="chardet")

# Using the client with character-set autodetection enabled.
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "chardet".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "chardet" autodetection.
```
"""
import codecs
import typing


class ChardetCodec(codecs.Codec):
def encode(self, input, errors="strict"): # type: ignore
raise RuntimeError(
"The 'chardet' codec does not support encoding."
) # pragma: nocover

def decode(self, input, errors="strict"): # type: ignore
import chardet

content: bytes = bytes(input)
info: dict = chardet.detect(content)
encoding: str = info.get("encoding") or "utf-8"
return content.decode(encoding, errors=errors), len(content)


class CharsetNormalizerCodec(codecs.Codec):
def encode(self, input, errors="strict"): # type: ignore
raise RuntimeError(
"The 'charset_normalizer' codec does not support encoding."
) # pragma: nocover

def decode(self, input, errors="strict"): # type: ignore
import charset_normalizer

content: bytes = bytes(input)
info: dict = charset_normalizer.detect(content)
encoding: str = info.get("encoding") or "utf-8"
return content.decode(encoding, errors=errors), len(content)


class NullIncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False): # type: ignore
raise RuntimeError("This codec does not support encoding.") # pragma: nocover


def charset_autodetect(encoding_name: str) -> typing.Optional[codecs.CodecInfo]:
if encoding_name == "chardet":
return codecs.CodecInfo(
name="chardet",
encode=ChardetCodec().encode, # type: ignore
decode=ChardetCodec().decode, # type: ignore
incrementalencoder=NullIncrementalEncoder,
# Note that for iter_text/aiter_text we *always* just fallback
# to using utf-8. Attempting character set autodetection in the
# incremental case can cause large amounts of buffering.
incrementaldecoder=codecs.getincrementaldecoder("utf-8"),
)

elif encoding_name == "charset_normalizer":
return codecs.CodecInfo(
name="charset_normalizer",
encode=CharsetNormalizerCodec().encode, # type: ignore
decode=CharsetNormalizerCodec().decode, # type: ignore
incrementalencoder=NullIncrementalEncoder,
# Note that for iter_text/aiter_text we *always* just fallback
# to using utf-8. Attempting character set autodetection in the
# incremental case can cause large amounts of buffering.
incrementaldecoder=codecs.getincrementaldecoder("utf-8"),
)

return None