Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode str bodies to Latin-1 instead of UTF-8 #3063

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pquentin
Copy link
Member

@pquentin pquentin commented Jun 7, 2023

Relates #3053 by fixing the immediate regression. We may introduce warnings in the future to help users not rely on this behavior. The changes are split in 3 commits (but squashing is fine if we merge this):

  1. I've changed the default encoding of simple bodies from UTF-8 to Latin-1, and made the code more explicit and tested
  2. Even though the formal name of the encoding is ISO-8859-1, we used latin-1 in enough places that I preferred to be consistent and updated the remaining occurrences of iso-8859-1 to latin-1.
  3. Chunked encoding already used Latin-1, but I've changed the test to ensure it stays that way in the future.

@pquentin pquentin changed the title Encode string bodies to Latin-1 isntead of UTF-8 Encode str bodies to Latin-1 instead of UTF-8 Jun 7, 2023
@@ -65,7 +65,7 @@ def _test_body(self, data: bytes | str | None) -> None:

assert b"Transfer-Encoding: chunked" in header.split(b"\r\n")
if data:
bdata = data if isinstance(data, bytes) else data.encode("utf-8")
bdata = data if isinstance(data, bytes) else data.encode("latin-1")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually suspicious because 1.26.x used utf-8 here:

def _test_body(self, data):
self.start_chunked_handler()
with HTTPConnectionPool(self.host, self.port, retries=False) as pool:
pool.urlopen("GET", "/", data, chunked=True)
header, body = self.buffer.split(b"\r\n\r\n", 1)
assert b"Transfer-Encoding: chunked" in header.split(b"\r\n")
if data:
bdata = data if isinstance(data, bytes) else data.encode("utf-8")
assert b"\r\n" + bdata + b"\r\n" in body
assert body.endswith(b"\r\n0\r\n\r\n")
len_str = body.split(b"\r\n", 1)[0]
stated_len = int(len_str, 16)
assert stated_len == len(bdata)
else:
assert body == b"0\r\n\r\n"
def test_bytestring_body(self):
self._test_body(b"thisshouldbeonechunk\r\nasdf")
def test_unicode_body(self):
self._test_body(u"thisshouldbeonechunk\r\näöüß")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, but there was no test case for non-ASCII characters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few non-ASCII characters on line 75

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And changing data.encode("utf-8") to data.encode("latin-1") breaks the test in 1.26.x

test_unicode_body failed; it passed 0 out of the required 1 times.
        <class 'AssertionError'>
        assert ((b'\r\n' + b'thisshouldbeonechunk\r\n\xe4\xf6\xfc\xdf') + b'\r\n') in b'1e\r\nthisshouldbeonechunk\r\n\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n0\r\n\r\n'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related encoding happens on this line in 1.26.x

chunk = chunk.encode("utf8")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good find! So we were encoding chunks with UTF-8 and other bodies with Latin-1, and 2.0 reversed that unintentionally. I'll fix that too.

@sethmlarson
Copy link
Member

Adding my two cents, if it's encoding as UTF-8 and the world isn't broken we may want to call this a "feature" and encode every body as UTF-8. That's what the world is using nowadays anyways, so we're likely to be causing more issues by using latin-1 instead?

@pquentin
Copy link
Member Author

pquentin commented Nov 15, 2023

I like the idea, but note that chunks are encoded as UTF-8 in 1.26.x but encoded as Latin-1 in 2.x. So we should only fix that part: encoding chunks as UTF-8?

@sethmlarson
Copy link
Member

@pquentin Oh gotcha, yeah let's UTF-8 all the things!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants