New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urllib3 with version 2.0.* treats control character differently #3053
Comments
You're right that something changed, I believe this is caused by us favoring |
About that, urllib3/src/urllib3/util/request.py Line 230 in b63cc4c
StringIO objects still are Latin-1 encoded. We want to consider adding a proper Content-Type when someone passes str as a body. Or the opposite, someone passes str but precise the Content-Type with specific valid encoding.
|
Indeed, >>> len("test\x80\x80\x01\x01\x81".encode("utf-8"))
12
>>> len("test\x80\x80\x01\x01\x81".encode("latin1"))
9 1.26.x vs 2.0urllib3 1.26.x relies on the standard library to encode urllib3 2.0 does its own encoding: urllib3/src/urllib3/util/request.py Lines 214 to 217 in b63cc4c
Since no encoding is provided to RFC 2616 vs RFC 91110The standard library cites RFC 2616 Section 3.7.1 as a justification for using ISO-8859-1. Indeed, it states:
RFC 2616 is now obsolete, I believe the relevant RFC is RFC 9110. Per https://www.rfc-editor.org/rfc/rfc9110#name-content-type, here is the current recommendation:
The media type is indeed unknown to us so we should not send a Content-Type. And nothing specifies the default encoding anymore, so I suppose using UTF-8 is OK, and it's up to the recipient to realize that I think? And of course UTF-8 is just the better encoding here. It is well designed, can encode all of Unicode and is the dominant encoding on the web. But that's still a breaking change. If we want to keep it, we should mention it in the release notes. (And be consistent, as mentioned in #3053 (comment) we sometimes continue using ISO-8859-1.) @sethmlarson @Ousret Thoughts? |
Hello @pquentin Even by RFC 2616, we are in the wrong.
urllib3 does not declare Content-Type. So pushing with a "default" charset is out of the specifications here.
Agreed!
Not necessarily. For me it is the most important point, initial intent, users without intermediary knowledge on this may be surprised to see what the server saves or yield afterward. Linked to #3045
So this is counterintuitive, we are making the very same mistake as the early days. Not specifying the charset explicitly by looking at our current era. Who can tell how long before websites switch massively to UTF-16 or... something else entirely? Also, the
Agreed! 👌 Should be mentioned. |
Thinking about the long-term vision for urllib3, I don't think there's a "right" default for us in this situation. Historically we've done Having thought about it for a bit, I'm wondering if we should consider reverting back to |
I'm happy to revert back to latin-1 in the interest of not breaking people needlessly. A warning sounds a bit heavy-handed however. For example, today, if you have not converted to |
I've just hit this problem on Friday when upgrading my dependencies. I'm using urllib3 via requests, so for me 1.x -> 2.x was just a matter of transitive dependencies upgrade, but I did notice urllib3 and I checked the release notes for mentions of breaking changes. In our case, the end service actually expects I've actually found it very surprising that a Some things that would improve the situation for me:
Hope this feedback is helpful and thanks for your hard work on this incredibly useful tool 🙌 |
Subject
The problem is after version 2.0.0, if you have control characters in your request body, urllib3 will count/treat them differently than previous version.
For control characters like '\x08', '\x01', urllib3 after version 2.0.0 and before will count them into different bytes, which might cause the Content-Length mismatch in some cases.
Environment
The environment I have is Python3.10, and I found this bug exists in all 2.0.* versions.
Steps to Reproduce
This bug is easy to be reproduced. Here is a sample code I used:
Run this code under different version of urllib3 (in my case, I used ver 1.26.6 and 2.0.2 for testing) you will find the 'Content-Length' of the response is totally different.
Expected Behavior
The response body of the request under two different versions of urllib3 should be the same.
Actual Behavior
The 'Content-Length' for ver 1.26.6 is 9 bytes, but for ver 2.0.2, it is 12 bytes.
The text was updated successfully, but these errors were encountered: