Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode stringIO bodies to utf-8 #3296

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zawan-ila
Copy link
Contributor

Closes 3053

I am not sure if I should add a warning.

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's a good step in the right direction. We still need to make a few changes. While most of them are mentioned inline, another one is that we need a test for the encoding of body too, since we don't have one. See https://github.com/urllib3/urllib3/pull/3063/files for inspiration, where you can copy test_encode_body_latin_1 (but rename it test_encode_body_utf8 and use UTF-8).

@@ -55,6 +55,7 @@ Here's a short summary of which changes in urllib3 v2.0 are most important:
- Changed the default minimum TLS version to TLS 1.2 (previously was TLS 1.0).
- Removed support for verifying certificate hostnames via ``commonName``, now only ``subjectAltName`` is used.
- Removed the default set of TLS ciphers, instead now urllib3 uses the list of ciphers configured by the system.
- Changed the default encoding for string bodies to ``utf-8``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethmlarson What do you think of a more detailed entry? Should it be more detailed only in CHANGES.rst (in the 2.0.0 release notes, at the beginning of the "Changed" section), or also detailed in the v2 migration guide?

Suggested change
- Changed the default encoding for string bodies to ``utf-8``.
- Changed encoding of ``str`` body chunks from UTF-8 to ISO-8859-1, and encoding of ``str`` body from ISO-8859-1 to UTF-8. This change was accidental. In an upcoming release, body chunks will also be encoded as UTF-8, to ensure all ``str`` bodies get encoded to UTF-8, for consistency. If you need a specific encoding, use ``str.encode`` to pass already-encoded bytes to urllib3 (`#3053 <https://github.com/urllib3/urllib3/issues/3053>`__).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str body chunks seems somewhat unclear. e.g Does it include string bodies sent as chunked encoding? I feel that using StringIO bodies or string streams would be clearer?

@@ -227,7 +227,7 @@ def chunk_readable() -> typing.Iterable[bytes]:
if not datablock:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment unchanged lines, but please also specify the utf-8 encoding to the to_bytes call. Explicit is better than implicit.

@@ -2380,7 +2380,7 @@ def body_generator() -> typing.Generator[bytes, None, None]:
body.seek(0, 0)
should_be_chunked = True
elif body_type == "file_text":
body = io.StringIO("x" * 10)
body = io.StringIO("x\x80\x81")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want all chunked bodies to be encoded to UTF-8, not only StringIO. Can you please use x * 9 + \x80 everywhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

urllib3 with version 2.0.* treats control character differently
2 participants