urllib3 with version 2.0.* treats control character differently #3053

XiangFrank · 2023-05-30T22:35:33Z

Subject

The problem is after version 2.0.0, if you have control characters in your request body, urllib3 will count/treat them differently than previous version.

For control characters like '\x08', '\x01', urllib3 after version 2.0.0 and before will count them into different bytes, which might cause the Content-Length mismatch in some cases.

Environment

The environment I have is Python3.10, and I found this bug exists in all 2.0.* versions.

Steps to Reproduce

This bug is easy to be reproduced. Here is a sample code I used:

import urllib3

poolmanager = urllib3.PoolManager(num_pools=10, maxsize=3, block=False)
conn = poolmanager.connection_from_url("https://httpbin.org")
resp = conn.urlopen(method="POST", url="/post", body='test\x80\x80\x01\x01\x81',  headers={
                                 'User-Agent': 'python-api/2', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'})
print(resp._body)

Run this code under different version of urllib3 (in my case, I used ver 1.26.6 and 2.0.2 for testing) you will find the 'Content-Length' of the response is totally different.

Expected Behavior

The response body of the request under two different versions of urllib3 should be the same.

Actual Behavior

The 'Content-Length' for ver 1.26.6 is 9 bytes, but for ver 2.0.2, it is 12 bytes.

sethmlarson · 2023-05-31T03:02:56Z

You're right that something changed, I believe this is caused by us favoring utf-8 encoding of strings in v2.0 instead of latin-1 in v1.26.x.

Ousret · 2023-05-31T03:17:21Z

About that,
There is a miss at

urllib3/src/urllib3/util/request.py

Line 230 in b63cc4c

datablock = datablock.encode("iso-8859-1")

StringIO objects still are Latin-1 encoded.
We want to consider adding a proper Content-Type when someone passes str as a body. Or the opposite, someone passes str but precise the Content-Type with specific valid encoding.

pquentin · 2023-05-31T13:30:10Z

Indeed, \x80 and \x81 are outside of the 7-bit range, so they get represented using two bytes with UTF-8:

>>> len("test\x80\x80\x01\x01\x81".encode("utf-8"))
12
>>> len("test\x80\x80\x01\x01\x81".encode("latin1"))
9

1.26.x vs 2.0

urllib3 1.26.x relies on the standard library to encode str bodies, see https://github.com/python/cpython/blob/03ad6624c2b4a30cccf6d5b7e2b9999e104444ae/Lib/http/client.py#L1362-L1365. It uses the iso-8859-1 encoding.

urllib3 2.0 does its own encoding:

urllib3/src/urllib3/util/request.py

Lines 214 to 217 in b63cc4c

    
           # Bytes or strings become bytes 
        
           elif isinstance(body, (str, bytes)): 
        
               chunks = (to_bytes(body),) 
        
               content_length = len(chunks[0])

Since no encoding is provided to to_bytes, it will call str.encode(), and the default encoding there is UTF-8: https://docs.python.org/3/library/stdtypes.html#str.encode. (This seems quite implicit, it would be clearer to call to_bytes with the utf-8 encoding).

RFC 2616 vs RFC 91110

The standard library cites RFC 2616 Section 3.7.1 as a justification for using ISO-8859-1. Indeed, it states:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

RFC 2616 is now obsolete, I believe the relevant RFC is RFC 9110. Per https://www.rfc-editor.org/rfc/rfc9110#name-content-type, here is the current recommendation:

The "Content-Type" header field indicates the media type of the associated representation: either the representation enclosed in the message content or the selected representation, as determined by the message semantics. The indicated media type defines both the data format and how that data is intended to be processed by a recipient, within the scope of the received message semantics, after any content codings indicated by Content-Encoding are decoded.
Content-Type = media-type
Media types are defined in Section 8.3.1. An example of the field is
Content-Type: text/html; charset=ISO-8859-4
A sender that generates a message containing content SHOULD generate a Content-Type header field in that message unless the intended media type of the enclosed representation is unknown to the sender. If a Content-Type header field is not present, the recipient MAY either assume a media type of "application/octet-stream" ([RFC2046], Section 4.5.1) or examine the data to determine its type.

The media type is indeed unknown to us so we should not send a Content-Type. And nothing specifies the default encoding anymore, so I suppose using UTF-8 is OK, and it's up to the recipient to realize that I think? And of course UTF-8 is just the better encoding here. It is well designed, can encode all of Unicode and is the dominant encoding on the web.

But that's still a breaking change. If we want to keep it, we should mention it in the release notes. (And be consistent, as mentioned in #3053 (comment) we sometimes continue using ISO-8859-1.)

@sethmlarson @Ousret Thoughts?

Ousret · 2023-06-01T09:04:02Z

Hello @pquentin

Even by RFC 2616, we are in the wrong.

The standard library cites RFC 2616 Section 3.7.1 as a justification for using ISO-8859-1. Indeed, it states: ...... When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined...

urllib3 does not declare Content-Type. So pushing with a "default" charset is out of the specifications here.

RFC 2616 is now obsolete, I believe the relevant RFC is RFC 9110.

Agreed!

The media type is indeed unknown to us so we should not send a Content-Type.

Not necessarily. str is to be considered plain text, this is not an objective way to pass data.
So, there, when the server receives something that it must label to application/octet-stream it breaks the initial intent.

For me it is the most important point, initial intent, users without intermediary knowledge on this may be surprised to see what the server saves or yield afterward. Linked to #3045

And nothing specifies the default encoding anymore, so I suppose using UTF-8 is OK

So this is counterintuitive, we are making the very same mistake as the early days. Not specifying the charset explicitly by looking at our current era. Who can tell how long before websites switch massively to UTF-16 or... something else entirely?
It would not cost us much to be explicit there.

Also, the And nothing specifies the default encoding anymore can be interpreted as it does not exist, therefor no default, no implicit, always explicit.

But that's still a breaking change. If we want to keep it, we should mention it in the release notes. (And be consistent, as mentioned in #3053 (comment) we sometimes continue using ISO-8859-1.)

Agreed! 👌 Should be mentioned.

sethmlarson · 2023-06-01T19:54:09Z

Thinking about the long-term vision for urllib3, I don't think there's a "right" default for us in this situation. Historically we've done latin-1, we know the web is almost all utf-8, but in general even if we do provide an encoding default we still don't want users to lean on it at all and instead we'd prefer they provide us bytes (or some other native type that we can make assumptions around like json= or fields=).

Having thought about it for a bit, I'm wondering if we should consider reverting back to latin-1 and deprecating sending non-bytes in the body parameter (either a single string, StringIO, or in an iterable). For reverting, I'm thinking that we didn't intend for this to happen since we didn't document it in the changelog, so might catch more folks off-guard. The warning message provides us a chance to get folks to do the right thing instead and weans us off this ambiguous situation.

pquentin · 2023-06-02T11:20:56Z

I'm happy to revert back to latin-1 in the interest of not breaking people needlessly.

A warning sounds a bit heavy-handed however. For example, today, if you have not converted to urllib3.request(json=...), the obvious way to send json is with body=json.dumps(data), which is going to emit an ASCII string, and is perfectly safe. We could call str.isascii (which is O(1) and thus fast in Python 3.7+) to only emit a warning if the result would have been different with latin-1 vs utf-8, however, but that sounds a bit complicated and surprising.

oinopion · 2023-10-02T12:46:31Z

I've just hit this problem on Friday when upgrading my dependencies. I'm using urllib3 via requests, so for me 1.x -> 2.x was just a matter of transitive dependencies upgrade, but I did notice urllib3 and I checked the release notes for mentions of breaking changes. In our case, the end service actually expects latin1, but since this was default, we never realised that the code didn't explicitly encode the body.

I've actually found it very surprising that a str body was default-encoded using latin1, especially that a quick browse of the old docs doesn't reveal it (but maybe I didn't search well enough). If anything, I would expect it to call str.encode(), which defaults to UTF-8 or maybe ASCII as a safe "base" encoding.

Some things that would improve the situation for me:

since this is a known thing, mentioning it on the 2.0 migration page would have saved me couple hours of head scratching
I would appreciate having a runtime warning to tell me that I'm defaulting to non-UTF-8 encoding, something that would prompt me to encode it on my side explicitly
I would prefer the default to stay as UTF-8, just additionally mentioned in the docs 🙏

Hope this feedback is helpful and thanks for your hard work on this incredibly useful tool 🙌

sethmlarson · 2024-01-09T04:06:05Z

I agree completely with @oinopion's proposal, let's do that if no one else objects? Going to require @pquentin's review on any potential PRs since he was involved in the discussion.

pquentin mentioned this issue Jun 7, 2023

Encode str bodies to Latin-1 instead of UTF-8 #3063

Open

nenb mentioned this issue Oct 9, 2023

Bound urllib3<2 for consistent CPython3.X behaviour SuperDuperDB/superduperdb#1062

Merged

5 tasks

sethmlarson added Contributor Friendly ♥ 💰 Bounty $100 If you complete this issue we'll pay you $100 on OpenCollective! labels Jan 9, 2024

zawan-ila linked a pull request Jan 21, 2024 that will close this issue

encode stringIO bodies to utf-8 #3296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urllib3 with version 2.0.* treats control character differently #3053

urllib3 with version 2.0.* treats control character differently #3053

XiangFrank commented May 30, 2023

sethmlarson commented May 31, 2023

Ousret commented May 31, 2023 •

edited

pquentin commented May 31, 2023 •

edited

Ousret commented Jun 1, 2023 •

edited

sethmlarson commented Jun 1, 2023 •

edited

pquentin commented Jun 2, 2023

oinopion commented Oct 2, 2023 •

edited

sethmlarson commented Jan 9, 2024

urllib3 with version 2.0.* treats control character differently #3053

urllib3 with version 2.0.* treats control character differently #3053

Comments

XiangFrank commented May 30, 2023

Subject

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

sethmlarson commented May 31, 2023

Ousret commented May 31, 2023 • edited

pquentin commented May 31, 2023 • edited

1.26.x vs 2.0

RFC 2616 vs RFC 91110

Ousret commented Jun 1, 2023 • edited

sethmlarson commented Jun 1, 2023 • edited

pquentin commented Jun 2, 2023

oinopion commented Oct 2, 2023 • edited

sethmlarson commented Jan 9, 2024

Ousret commented May 31, 2023 •

edited

pquentin commented May 31, 2023 •

edited

Ousret commented Jun 1, 2023 •

edited

sethmlarson commented Jun 1, 2023 •

edited

oinopion commented Oct 2, 2023 •

edited