Encode str bodies to Latin-1 instead of UTF-8 #3063

pquentin · 2023-06-07T07:04:28Z

Relates #3053 by fixing the immediate regression. We may introduce warnings in the future to help users not rely on this behavior. The changes are split in 3 commits (but squashing is fine if we merge this):

I've changed the default encoding of simple bodies from UTF-8 to Latin-1, and made the code more explicit and tested
Even though the formal name of the encoding is ISO-8859-1, we used latin-1 in enough places that I preferred to be consistent and updated the remaining occurrences of iso-8859-1 to latin-1.
Chunked encoding already used Latin-1, but I've changed the test to ensure it stays that way in the future.

pquentin · 2023-06-07T09:38:06Z

test/with_dummyserver/test_chunked_transfer.py

@@ -65,7 +65,7 @@ def _test_body(self, data: bytes | str | None) -> None:

            assert b"Transfer-Encoding: chunked" in header.split(b"\r\n")
            if data:
-                bdata = data if isinstance(data, bytes) else data.encode("utf-8")
+                bdata = data if isinstance(data, bytes) else data.encode("latin-1")


This is actually suspicious because 1.26.x used utf-8 here:

urllib3/test/with_dummyserver/test_chunked_transfer.py

Lines 53 to 75 in 3c01480

def _test_body(self, data):

self.start_chunked_handler()

with HTTPConnectionPool(self.host, self.port, retries=False) as pool:

pool.urlopen("GET", "/", data, chunked=True)

header, body = self.buffer.split(b"\r\n\r\n", 1)

assert b"Transfer-Encoding: chunked" in header.split(b"\r\n")

if data:

bdata = data if isinstance(data, bytes) else data.encode("utf-8")

assert b"\r\n" + bdata + b"\r\n" in body

assert body.endswith(b"\r\n0\r\n\r\n")

len_str = body.split(b"\r\n", 1)[0]

stated_len = int(len_str, 16)

assert stated_len == len(bdata)

else:

assert body == b"0\r\n\r\n"

def test_bytestring_body(self):

self._test_body(b"thisshouldbeonechunk\r\nasdf")

def test_unicode_body(self):

self._test_body(u"thisshouldbeonechunk\r\näöüß")

Weird, but there was no test case for non-ASCII characters?

There are a few non-ASCII characters on line 75

And changing data.encode("utf-8") to data.encode("latin-1") breaks the test in 1.26.x

test_unicode_body failed; it passed 0 out of the required 1 times. <class 'AssertionError'> assert ((b'\r\n' + b'thisshouldbeonechunk\r\n\xe4\xf6\xfc\xdf') + b'\r\n') in b'1e\r\nthisshouldbeonechunk\r\n\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n0\r\n\r\n'

Related encoding happens on this line in 1.26.x

urllib3/src/urllib3/connection.py

Line 274 in 57181d6

chunk = chunk.encode("utf8")

Thanks, good find! So we were encoding chunks with UTF-8 and other bodies with Latin-1, and 2.0 reversed that unintentionally. I'll fix that too.

sethmlarson · 2023-11-15T03:25:23Z

Adding my two cents, if it's encoding as UTF-8 and the world isn't broken we may want to call this a "feature" and encode every body as UTF-8. That's what the world is using nowadays anyways, so we're likely to be causing more issues by using latin-1 instead?

pquentin · 2023-11-15T07:42:38Z

I like the idea, but note that chunks are encoded as UTF-8 in 1.26.x but encoded as Latin-1 in 2.x. So we should only fix that part: encoding chunks as UTF-8?

sethmlarson · 2023-11-16T05:20:18Z

@pquentin Oh gotcha, yeah let's UTF-8 all the things!

pquentin added 4 commits June 7, 2023 10:57

Encode string bodies with latin-1, not utf-8

80ad241

Use latin-1 instead of iso-8859-1 for consistency

d97db4e

Test that chunks are encoded as latin-1 too

712392c

Fix test_unicode_body

df16e3b

pquentin changed the title ~~Encode string bodies to Latin-1 isntead of UTF-8~~ Encode str bodies to Latin-1 instead of UTF-8 Jun 7, 2023

pquentin commented Jun 7, 2023

View reviewed changes

Merge branch 'main' into latin-1-default

ad344ff

sethmlarson requested a review from illia-v November 15, 2023 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode str bodies to Latin-1 instead of UTF-8 #3063

Encode str bodies to Latin-1 instead of UTF-8 #3063

pquentin commented Jun 7, 2023

pquentin Jun 7, 2023

sethmlarson Jun 7, 2023

illia-v Jun 27, 2023

illia-v Jun 27, 2023

illia-v Jun 27, 2023

pquentin Jul 24, 2023

sethmlarson commented Nov 15, 2023

pquentin commented Nov 15, 2023 •

edited

sethmlarson commented Nov 16, 2023

	def _test_body(self, data):
	self.start_chunked_handler()
	with HTTPConnectionPool(self.host, self.port, retries=False) as pool:
	pool.urlopen("GET", "/", data, chunked=True)
	header, body = self.buffer.split(b"\r\n\r\n", 1)

	assert b"Transfer-Encoding: chunked" in header.split(b"\r\n")
	if data:
	bdata = data if isinstance(data, bytes) else data.encode("utf-8")
	assert b"\r\n" + bdata + b"\r\n" in body
	assert body.endswith(b"\r\n0\r\n\r\n")

	len_str = body.split(b"\r\n", 1)[0]
	stated_len = int(len_str, 16)
	assert stated_len == len(bdata)
	else:
	assert body == b"0\r\n\r\n"

	def test_bytestring_body(self):
	self._test_body(b"thisshouldbeonechunk\r\nasdf")

	def test_unicode_body(self):
	self._test_body(u"thisshouldbeonechunk\r\näöüß")

Encode str bodies to Latin-1 instead of UTF-8 #3063

Are you sure you want to change the base?

Encode str bodies to Latin-1 instead of UTF-8 #3063

Conversation

pquentin commented Jun 7, 2023

pquentin Jun 7, 2023

Choose a reason for hiding this comment

sethmlarson Jun 7, 2023

Choose a reason for hiding this comment

illia-v Jun 27, 2023

Choose a reason for hiding this comment

illia-v Jun 27, 2023

Choose a reason for hiding this comment

illia-v Jun 27, 2023

Choose a reason for hiding this comment

pquentin Jul 24, 2023

Choose a reason for hiding this comment

sethmlarson commented Nov 15, 2023

pquentin commented Nov 15, 2023 • edited

sethmlarson commented Nov 16, 2023

pquentin commented Nov 15, 2023 •

edited