Skip to content

Commit

Permalink
Make charset auto-detection optional. (#2165)
Browse files Browse the repository at this point in the history
* Add Response(..., default_encoding=...)

* Add tests for Response(..., default_encoding=...)

* Add Client(..., default_encoding=...)

* Switch default encoding to 'utf-8' instead of 'autodetect'

* Make charset_normalizer an optional dependancy, not a mandatory one.

* Documentation

* Use callable for default_encoding

* Update tests for new charset autodetection API

* Update docs for new charset autodetection API

* Update requirements

* Drop charset_normalizer from requirements
  • Loading branch information
tomchristie committed May 23, 2022
1 parent 940d61b commit 1c33a28
Show file tree
Hide file tree
Showing 11 changed files with 245 additions and 54 deletions.
1 change: 0 additions & 1 deletion README.md
Expand Up @@ -128,7 +128,6 @@ The HTTPX project relies on these excellent libraries:
* `httpcore` - The underlying transport implementation for `httpx`.
* `h11` - HTTP/1.1 support.
* `certifi` - SSL certificates.
* `charset_normalizer` - Charset auto-detection.
* `rfc3986` - URL parsing & normalization.
* `idna` - Internationalized domain name support.
* `sniffio` - Async library autodetection.
Expand Down
1 change: 0 additions & 1 deletion README_chinese.md
Expand Up @@ -129,7 +129,6 @@ HTTPX项目依赖于这些优秀的库:
* `h11` - HTTP/1.1 support.
* `h2` - HTTP/2 support. *(Optional, with `httpx[http2]`)*
* `certifi` - SSL certificates.
* `charset_normalizer` - Charset auto-detection.
* `rfc3986` - URL parsing & normalization.
* `idna` - Internationalized domain name support.
* `sniffio` - Async library autodetection.
Expand Down
82 changes: 82 additions & 0 deletions docs/advanced.md
Expand Up @@ -145,6 +145,88 @@ URL('http://httpbin.org/headers')

For a list of all available client parameters, see the [`Client`](api.md#client) API reference.

---

## Character set encodings and auto-detection

When accessing `response.text`, we need to decode the response bytes into a unicode text representation.

By default `httpx` will use `"charset"` information included in the response `Content-Type` header to determine how the response bytes should be decoded into text.

In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.

### Using the default encoding

To understand this better let's start by looking at the default behaviour for text decoding...

```python
import httpx
# Instantiate a client with the default configuration.
client = httpx.Client()
# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "utf-8".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "utf-8".
```

This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.

### Using an explicit encoding

In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

```python
import httpx
# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")
# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "shift-jis".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "shift-jis".
```

### Using character set auto-detection

In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

To use auto-detection you need to set the `default_encoding` argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.

There are two widely used Python packages which both handle this functionality:

* [`chardet`](https://chardet.readthedocs.io/) - This is a well established package, and is a port of [the auto-detection code in Mozilla](https://www-archive.mozilla.org/projects/intl/chardet.html).
* [`charset-normalizer`](https://charset-normalizer.readthedocs.io/) - A newer package, motivated by `chardet`, with a different approach.

Let's take a look at installing autodetection using one of these packages...

```shell
$ pip install httpx
$ pip install chardet
```

Once `chardet` is installed, we can configure a client to use character-set autodetection.

```python
import httpx
import chardet

def autodetect(content):
return chardet.detect(content).get("encoding")

# Using a client with character-set autodetection enabled.
client = httpx.Client(default_encoding=autodetect)
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else the auto-detected
# character set.
print(response.text)
```

---

## Calling into Python Web Apps

You can configure an `httpx` client to call directly into a Python web application using the WSGI protocol.
Expand Down
1 change: 0 additions & 1 deletion docs/index.md
Expand Up @@ -109,7 +109,6 @@ The HTTPX project relies on these excellent libraries:
* `httpcore` - The underlying transport implementation for `httpx`.
* `h11` - HTTP/1.1 support.
* `certifi` - SSL certificates.
* `charset_normalizer` - Charset auto-detection.
* `rfc3986` - URL parsing & normalization.
* `idna` - Internationalized domain name support.
* `sniffio` - Async library autodetection.
Expand Down
14 changes: 14 additions & 0 deletions httpx/_client.py
Expand Up @@ -168,6 +168,7 @@ def __init__(
] = None,
base_url: URLTypes = "",
trust_env: bool = True,
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
):
event_hooks = {} if event_hooks is None else event_hooks

Expand All @@ -185,6 +186,7 @@ def __init__(
"response": list(event_hooks.get("response", [])),
}
self._trust_env = trust_env
self._default_encoding = default_encoding
self._netrc = NetRCInfo()
self._state = ClientState.UNOPENED

Expand Down Expand Up @@ -611,6 +613,9 @@ class Client(BaseClient):
rather than sending actual network requests.
* **trust_env** - *(optional)* Enables or disables usage of environment
variables for configuration.
* **default_encoding** - *(optional)* The default encoding to use for decoding
response text, if no charset information is included in a response Content-Type
header. Set to a callable for automatic character set detection. Default: "utf-8".
"""

def __init__(
Expand All @@ -637,6 +642,7 @@ def __init__(
transport: typing.Optional[BaseTransport] = None,
app: typing.Optional[typing.Callable] = None,
trust_env: bool = True,
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
):
super().__init__(
auth=auth,
Expand All @@ -649,6 +655,7 @@ def __init__(
event_hooks=event_hooks,
base_url=base_url,
trust_env=trust_env,
default_encoding=default_encoding,
)

if http2:
Expand Down Expand Up @@ -1002,6 +1009,7 @@ def _send_single_request(self, request: Request) -> Response:
response.stream, response=response, timer=timer
)
self.cookies.extract_cookies(response)
response.default_encoding = self._default_encoding

status = f"{response.status_code} {response.reason_phrase}"
response_line = f"{response.http_version} {status}"
Expand Down Expand Up @@ -1326,6 +1334,9 @@ class AsyncClient(BaseClient):
rather than sending actual network requests.
* **trust_env** - *(optional)* Enables or disables usage of environment
variables for configuration.
* **default_encoding** - *(optional)* The default encoding to use for decoding
response text, if no charset information is included in a response Content-Type
header. Set to a callable for automatic character set detection. Default: "utf-8".
"""

def __init__(
Expand All @@ -1352,6 +1363,7 @@ def __init__(
transport: typing.Optional[AsyncBaseTransport] = None,
app: typing.Optional[typing.Callable] = None,
trust_env: bool = True,
default_encoding: str = "utf-8",
):
super().__init__(
auth=auth,
Expand All @@ -1364,6 +1376,7 @@ def __init__(
event_hooks=event_hooks,
base_url=base_url,
trust_env=trust_env,
default_encoding=default_encoding,
)

if http2:
Expand Down Expand Up @@ -1708,6 +1721,7 @@ async def _send_single_request(self, request: Request) -> Response:
response.stream, response=response, timer=timer
)
self.cookies.extract_cookies(response)
response.default_encoding = self._default_encoding

status = f"{response.status_code} {response.reason_phrase}"
response_line = f"{response.http_version} {status}"
Expand Down
30 changes: 11 additions & 19 deletions httpx/_models.py
Expand Up @@ -7,8 +7,6 @@
from collections.abc import MutableMapping
from http.cookiejar import Cookie, CookieJar

import charset_normalizer

from ._content import ByteStream, UnattachedStream, encode_request, encode_response
from ._decoders import (
SUPPORTED_DECODERS,
Expand Down Expand Up @@ -445,6 +443,7 @@ def __init__(
request: typing.Optional[Request] = None,
extensions: typing.Optional[dict] = None,
history: typing.Optional[typing.List["Response"]] = None,
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
):
self.status_code = status_code
self.headers = Headers(headers)
Expand All @@ -461,6 +460,8 @@ def __init__(
self.is_closed = False
self.is_stream_consumed = False

self.default_encoding = default_encoding

if stream is None:
headers, stream = encode_response(content, text, html, json)
self._prepare(headers)
Expand Down Expand Up @@ -569,14 +570,18 @@ def encoding(self) -> typing.Optional[str]:
* `.encoding = <>` has been set explicitly.
* The encoding as specified by the charset parameter in the Content-Type header.
* The encoding as determined by `charset_normalizer`.
* UTF-8.
* The encoding as determined by `default_encoding`, which may either be
a string like "utf-8" indicating the encoding to use, or may be a callable
which enables charset autodetection.
"""
if not hasattr(self, "_encoding"):
encoding = self.charset_encoding
if encoding is None or not is_known_encoding(encoding):
encoding = self.apparent_encoding
self._encoding = encoding
if isinstance(self.default_encoding, str):
encoding = self.default_encoding
elif hasattr(self, "_content"):
encoding = self.default_encoding(self._content)
self._encoding = encoding or "utf-8"
return self._encoding

@encoding.setter
Expand All @@ -598,19 +603,6 @@ def charset_encoding(self) -> typing.Optional[str]:

return params["charset"].strip("'\"")

@property
def apparent_encoding(self) -> typing.Optional[str]:
"""
Return the encoding, as determined by `charset_normalizer`.
"""
content = getattr(self, "_content", b"")
if len(content) < 32:
# charset_normalizer will issue warnings if we run it with
# fewer bytes than this cutoff.
return None
match = charset_normalizer.from_bytes(self.content).best()
return None if match is None else match.encoding

def _get_content_decoder(self) -> ContentDecoder:
"""
Returns a decoder instance which can be used to decode the raw byte
Expand Down
5 changes: 4 additions & 1 deletion requirements.txt
Expand Up @@ -4,7 +4,10 @@
# Reference: https://github.com/encode/httpx/pull/1721#discussion_r661241588
-e .[brotli,cli,http2,socks]

charset-normalizer==2.0.6
# Optional charset auto-detection
# Used in our test cases
chardet==4.0.0
types-chardet==4.0.4

# Documentation
mkdocs==1.3.0
Expand Down
1 change: 0 additions & 1 deletion setup.py
Expand Up @@ -57,7 +57,6 @@ def get_packages(package):
zip_safe=False,
install_requires=[
"certifi",
"charset_normalizer",
"sniffio",
"rfc3986[idna2008]>=1.3,<2",
"httpcore>=0.15.0,<0.16.0",
Expand Down
62 changes: 61 additions & 1 deletion tests/client/test_client.py
@@ -1,11 +1,16 @@
import typing
from datetime import timedelta

import chardet
import pytest

import httpx


def autodetect(content):
return chardet.detect(content).get("encoding")


def test_get(server):
url = server.url
with httpx.Client(http2=True) as http:
Expand All @@ -15,7 +20,7 @@ def test_get(server):
assert response.content == b"Hello, world!"
assert response.text == "Hello, world!"
assert response.http_version == "HTTP/1.1"
assert response.encoding is None
assert response.encoding == "utf-8"
assert response.request.url == url
assert response.headers
assert response.is_redirect is False
Expand Down Expand Up @@ -398,3 +403,58 @@ def test_server_extensions(server):
response = client.get(url)
assert response.status_code == 200
assert response.extensions["http_version"] == b"HTTP/1.1"


def test_client_decode_text_using_autodetect():
# Ensure that a 'default_encoding=autodetect' on the response allows for
# encoding autodetection to be used when no "Content-Type: text/plain; charset=..."
# info is present.
#
# Here we have some french text encoded with ISO-8859-1, rather than UTF-8.
text = (
"Non-seulement Despréaux ne se trompait pas, mais de tous les écrivains "
"que la France a produits, sans excepter Voltaire lui-même, imprégné de "
"l'esprit anglais par son séjour à Londres, c'est incontestablement "
"Molière ou Poquelin qui reproduit avec l'exactitude la plus vive et la "
"plus complète le fond du génie français."
)

def cp1252_but_no_content_type(request):
content = text.encode("ISO-8859-1")
return httpx.Response(200, content=content)

transport = httpx.MockTransport(cp1252_but_no_content_type)
with httpx.Client(transport=transport, default_encoding=autodetect) as client:
response = client.get("http://www.example.com")

assert response.status_code == 200
assert response.reason_phrase == "OK"
assert response.encoding == "ISO-8859-1"
assert response.text == text


def test_client_decode_text_using_explicit_encoding():
# Ensure that a 'default_encoding="..."' on the response is used for text decoding
# when no "Content-Type: text/plain; charset=..."" info is present.
#
# Here we have some french text encoded with ISO-8859-1, rather than UTF-8.
text = (
"Non-seulement Despréaux ne se trompait pas, mais de tous les écrivains "
"que la France a produits, sans excepter Voltaire lui-même, imprégné de "
"l'esprit anglais par son séjour à Londres, c'est incontestablement "
"Molière ou Poquelin qui reproduit avec l'exactitude la plus vive et la "
"plus complète le fond du génie français."
)

def cp1252_but_no_content_type(request):
content = text.encode("ISO-8859-1")
return httpx.Response(200, content=content)

transport = httpx.MockTransport(cp1252_but_no_content_type)
with httpx.Client(transport=transport, default_encoding=autodetect) as client:
response = client.get("http://www.example.com")

assert response.status_code == 200
assert response.reason_phrase == "OK"
assert response.encoding == "ISO-8859-1"
assert response.text == text

0 comments on commit 1c33a28

Please sign in to comment.