Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib3 with version 2.0.* treats control character differently #3053

Open
XiangFrank opened this issue May 30, 2023 · 8 comments · May be fixed by #3296
Open

urllib3 with version 2.0.* treats control character differently #3053

XiangFrank opened this issue May 30, 2023 · 8 comments · May be fixed by #3296
Labels
💰 Bounty $100 If you complete this issue we'll pay you $100 on OpenCollective! Contributor Friendly ♥

Comments

@XiangFrank
Copy link

Subject

The problem is after version 2.0.0, if you have control characters in your request body, urllib3 will count/treat them differently than previous version.

For control characters like '\x08', '\x01', urllib3 after version 2.0.0 and before will count them into different bytes, which might cause the Content-Length mismatch in some cases.

Environment

The environment I have is Python3.10, and I found this bug exists in all 2.0.* versions.

Steps to Reproduce

This bug is easy to be reproduced. Here is a sample code I used:

import urllib3

poolmanager = urllib3.PoolManager(num_pools=10, maxsize=3, block=False)
conn = poolmanager.connection_from_url("https://httpbin.org")
resp = conn.urlopen(method="POST", url="/post", body='test\x80\x80\x01\x01\x81',  headers={
                                 'User-Agent': 'python-api/2', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'})
print(resp._body)

Run this code under different version of urllib3 (in my case, I used ver 1.26.6 and 2.0.2 for testing) you will find the 'Content-Length' of the response is totally different.

Expected Behavior

The response body of the request under two different versions of urllib3 should be the same.

Actual Behavior

The 'Content-Length' for ver 1.26.6 is 9 bytes, but for ver 2.0.2, it is 12 bytes.

@sethmlarson
Copy link
Member

You're right that something changed, I believe this is caused by us favoring utf-8 encoding of strings in v2.0 instead of latin-1 in v1.26.x.

@Ousret
Copy link
Contributor

Ousret commented May 31, 2023

About that,
There is a miss at

datablock = datablock.encode("iso-8859-1")

StringIO objects still are Latin-1 encoded.
We want to consider adding a proper Content-Type when someone passes str as a body. Or the opposite, someone passes str but precise the Content-Type with specific valid encoding.

@pquentin
Copy link
Member

pquentin commented May 31, 2023

Indeed, \x80 and \x81 are outside of the 7-bit range, so they get represented using two bytes with UTF-8:

>>> len("test\x80\x80\x01\x01\x81".encode("utf-8"))
12
>>> len("test\x80\x80\x01\x01\x81".encode("latin1"))
9

1.26.x vs 2.0

urllib3 1.26.x relies on the standard library to encode str bodies, see https://github.com/python/cpython/blob/03ad6624c2b4a30cccf6d5b7e2b9999e104444ae/Lib/http/client.py#L1362-L1365. It uses the iso-8859-1 encoding.

urllib3 2.0 does its own encoding:

# Bytes or strings become bytes
elif isinstance(body, (str, bytes)):
chunks = (to_bytes(body),)
content_length = len(chunks[0])

Since no encoding is provided to to_bytes, it will call str.encode(), and the default encoding there is UTF-8: https://docs.python.org/3/library/stdtypes.html#str.encode. (This seems quite implicit, it would be clearer to call to_bytes with the utf-8 encoding).

RFC 2616 vs RFC 91110

The standard library cites RFC 2616 Section 3.7.1 as a justification for using ISO-8859-1. Indeed, it states:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

RFC 2616 is now obsolete, I believe the relevant RFC is RFC 9110. Per https://www.rfc-editor.org/rfc/rfc9110#name-content-type, here is the current recommendation:

The "Content-Type" header field indicates the media type of the associated representation: either the representation enclosed in the message content or the selected representation, as determined by the message semantics. The indicated media type defines both the data format and how that data is intended to be processed by a recipient, within the scope of the received message semantics, after any content codings indicated by Content-Encoding are decoded.

Content-Type = media-type

Media types are defined in Section 8.3.1. An example of the field is

Content-Type: text/html; charset=ISO-8859-4

A sender that generates a message containing content SHOULD generate a Content-Type header field in that message unless the intended media type of the enclosed representation is unknown to the sender. If a Content-Type header field is not present, the recipient MAY either assume a media type of "application/octet-stream" ([RFC2046], Section 4.5.1) or examine the data to determine its type.

The media type is indeed unknown to us so we should not send a Content-Type. And nothing specifies the default encoding anymore, so I suppose using UTF-8 is OK, and it's up to the recipient to realize that I think? And of course UTF-8 is just the better encoding here. It is well designed, can encode all of Unicode and is the dominant encoding on the web.

But that's still a breaking change. If we want to keep it, we should mention it in the release notes. (And be consistent, as mentioned in #3053 (comment) we sometimes continue using ISO-8859-1.)

@sethmlarson @Ousret Thoughts?

@Ousret
Copy link
Contributor

Ousret commented Jun 1, 2023

Hello @pquentin

Even by RFC 2616, we are in the wrong.

The standard library cites RFC 2616 Section 3.7.1 as a justification for using ISO-8859-1. Indeed, it states: ...... When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined...

urllib3 does not declare Content-Type. So pushing with a "default" charset is out of the specifications here.

RFC 2616 is now obsolete, I believe the relevant RFC is RFC 9110.

Agreed!

The media type is indeed unknown to us so we should not send a Content-Type.

Not necessarily. str is to be considered plain text, this is not an objective way to pass data.
So, there, when the server receives something that it must label to application/octet-stream it breaks the initial intent.

For me it is the most important point, initial intent, users without intermediary knowledge on this may be surprised to see what the server saves or yield afterward. Linked to #3045

And nothing specifies the default encoding anymore, so I suppose using UTF-8 is OK

So this is counterintuitive, we are making the very same mistake as the early days. Not specifying the charset explicitly by looking at our current era. Who can tell how long before websites switch massively to UTF-16 or... something else entirely?
It would not cost us much to be explicit there.

Also, the And nothing specifies the default encoding anymore can be interpreted as it does not exist, therefor no default, no implicit, always explicit.

But that's still a breaking change. If we want to keep it, we should mention it in the release notes. (And be consistent, as mentioned in #3053 (comment) we sometimes continue using ISO-8859-1.)

Agreed! 👌 Should be mentioned.

@sethmlarson
Copy link
Member

sethmlarson commented Jun 1, 2023

Thinking about the long-term vision for urllib3, I don't think there's a "right" default for us in this situation. Historically we've done latin-1, we know the web is almost all utf-8, but in general even if we do provide an encoding default we still don't want users to lean on it at all and instead we'd prefer they provide us bytes (or some other native type that we can make assumptions around like json= or fields=).

Having thought about it for a bit, I'm wondering if we should consider reverting back to latin-1 and deprecating sending non-bytes in the body parameter (either a single string, StringIO, or in an iterable). For reverting, I'm thinking that we didn't intend for this to happen since we didn't document it in the changelog, so might catch more folks off-guard. The warning message provides us a chance to get folks to do the right thing instead and weans us off this ambiguous situation.

@pquentin
Copy link
Member

pquentin commented Jun 2, 2023

I'm happy to revert back to latin-1 in the interest of not breaking people needlessly.

A warning sounds a bit heavy-handed however. For example, today, if you have not converted to urllib3.request(json=...), the obvious way to send json is with body=json.dumps(data), which is going to emit an ASCII string, and is perfectly safe. We could call str.isascii (which is O(1) and thus fast in Python 3.7+) to only emit a warning if the result would have been different with latin-1 vs utf-8, however, but that sounds a bit complicated and surprising.

@oinopion
Copy link

oinopion commented Oct 2, 2023

I've just hit this problem on Friday when upgrading my dependencies. I'm using urllib3 via requests, so for me 1.x -> 2.x was just a matter of transitive dependencies upgrade, but I did notice urllib3 and I checked the release notes for mentions of breaking changes. In our case, the end service actually expects latin1, but since this was default, we never realised that the code didn't explicitly encode the body.

I've actually found it very surprising that a str body was default-encoded using latin1, especially that a quick browse of the old docs doesn't reveal it (but maybe I didn't search well enough). If anything, I would expect it to call str.encode(), which defaults to UTF-8 or maybe ASCII as a safe "base" encoding.

Some things that would improve the situation for me:

  • since this is a known thing, mentioning it on the 2.0 migration page would have saved me couple hours of head scratching
  • I would appreciate having a runtime warning to tell me that I'm defaulting to non-UTF-8 encoding, something that would prompt me to encode it on my side explicitly
  • I would prefer the default to stay as UTF-8, just additionally mentioned in the docs 🙏

Hope this feedback is helpful and thanks for your hard work on this incredibly useful tool 🙌

@sethmlarson
Copy link
Member

I agree completely with @oinopion's proposal, let's do that if no one else objects? Going to require @pquentin's review on any potential PRs since he was involved in the discussion.

@sethmlarson sethmlarson added Contributor Friendly ♥ 💰 Bounty $100 If you complete this issue we'll pay you $100 on OpenCollective! labels Jan 9, 2024
@zawan-ila zawan-ila linked a pull request Jan 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💰 Bounty $100 If you complete this issue we'll pay you $100 on OpenCollective! Contributor Friendly ♥
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants