encode stringIO bodies to utf-8 #3296

zawan-ila · 2024-01-21T20:44:08Z

Closes 3053

I am not sure if I should add a warning.

pquentin

Thanks! That's a good step in the right direction. We still need to make a few changes. While most of them are mentioned inline, another one is that we need a test for the encoding of body too, since we don't have one. See https://github.com/urllib3/urllib3/pull/3063/files for inspiration, where you can copy test_encode_body_latin_1 (but rename it test_encode_body_utf8 and use UTF-8).

pquentin · 2024-01-22T07:47:53Z

docs/v2-migration-guide.rst

@@ -55,6 +55,7 @@ Here's a short summary of which changes in urllib3 v2.0 are most important:
 - Changed the default minimum TLS version to TLS 1.2 (previously was TLS 1.0).
 - Removed support for verifying certificate hostnames via ``commonName``, now only ``subjectAltName`` is used.
 - Removed the default set of TLS ciphers, instead now urllib3 uses the list of ciphers configured by the system.
+- Changed the default encoding for string bodies to ``utf-8``.


@sethmlarson What do you think of a more detailed entry? Should it be more detailed only in CHANGES.rst (in the 2.0.0 release notes, at the beginning of the "Changed" section), or also detailed in the v2 migration guide?

Suggested change

- Changed the default encoding for string bodies to ``utf-8``.

- Changed encoding of ``str`` body chunks from UTF-8 to ISO-8859-1, and encoding of ``str`` body from ISO-8859-1 to UTF-8. This change was accidental. In an upcoming release, body chunks will also be encoded as UTF-8, to ensure all ``str`` bodies get encoded to UTF-8, for consistency. If you need a specific encoding, use ``str.encode`` to pass already-encoded bytes to urllib3 (`#3053 <https://github.com/urllib3/urllib3/issues/3053>`__).

str body chunks seems somewhat unclear. e.g Does it include string bodies sent as chunked encoding? I feel that using StringIO bodies or string streams would be clearer?

pquentin · 2024-01-22T07:49:37Z

src/urllib3/util/request.py

@@ -227,7 +227,7 @@ def chunk_readable() -> typing.Iterable[bytes]:
                if not datablock:


I can't comment unchanged lines, but please also specify the utf-8 encoding to the to_bytes call. Explicit is better than implicit.

pquentin · 2024-01-22T10:34:54Z

test/with_dummyserver/test_socketlevel.py

@@ -2380,7 +2380,7 @@ def body_generator() -> typing.Generator[bytes, None, None]:
            body.seek(0, 0)
            should_be_chunked = True
        elif body_type == "file_text":
-            body = io.StringIO("x" * 10)
+            body = io.StringIO("x\x80\x81")


We want all chunked bodies to be encoded to UTF-8, not only StringIO. Can you please use x * 9 + \x80 everywhere?

encode stringIO bodies to utf-8

f41ac89

pquentin reviewed Jan 22, 2024

View reviewed changes

ecerulm mentioned this pull request Jan 23, 2024

Pin to pypy-3.9-v7.3.13 in CI #3308

Merged

add and improve tests

c01de23

zawan-ila requested a review from pquentin January 25, 2024 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encode stringIO bodies to utf-8 #3296

encode stringIO bodies to utf-8 #3296

zawan-ila commented Jan 21, 2024

pquentin left a comment

pquentin Jan 22, 2024

zawan-ila Jan 25, 2024

pquentin Jan 22, 2024

pquentin Jan 22, 2024

		@@ -227,7 +227,7 @@ def chunk_readable() -> typing.Iterable[bytes]:
		if not datablock:

encode stringIO bodies to utf-8 #3296

Are you sure you want to change the base?

encode stringIO bodies to utf-8 #3296

Conversation

zawan-ila commented Jan 21, 2024

pquentin left a comment

Choose a reason for hiding this comment

pquentin Jan 22, 2024

Choose a reason for hiding this comment

zawan-ila Jan 25, 2024

Choose a reason for hiding this comment

pquentin Jan 22, 2024

Choose a reason for hiding this comment

pquentin Jan 22, 2024

Choose a reason for hiding this comment