New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encode stringIO bodies to utf-8 #3296
base: main
Are you sure you want to change the base?
encode stringIO bodies to utf-8 #3296
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That's a good step in the right direction. We still need to make a few changes. While most of them are mentioned inline, another one is that we need a test for the encoding of body
too, since we don't have one. See https://github.com/urllib3/urllib3/pull/3063/files for inspiration, where you can copy test_encode_body_latin_1
(but rename it test_encode_body_utf8
and use UTF-8).
@@ -55,6 +55,7 @@ Here's a short summary of which changes in urllib3 v2.0 are most important: | |||
- Changed the default minimum TLS version to TLS 1.2 (previously was TLS 1.0). | |||
- Removed support for verifying certificate hostnames via ``commonName``, now only ``subjectAltName`` is used. | |||
- Removed the default set of TLS ciphers, instead now urllib3 uses the list of ciphers configured by the system. | |||
- Changed the default encoding for string bodies to ``utf-8``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethmlarson What do you think of a more detailed entry? Should it be more detailed only in CHANGES.rst (in the 2.0.0 release notes, at the beginning of the "Changed" section), or also detailed in the v2 migration guide?
- Changed the default encoding for string bodies to ``utf-8``. | |
- Changed encoding of ``str`` body chunks from UTF-8 to ISO-8859-1, and encoding of ``str`` body from ISO-8859-1 to UTF-8. This change was accidental. In an upcoming release, body chunks will also be encoded as UTF-8, to ensure all ``str`` bodies get encoded to UTF-8, for consistency. If you need a specific encoding, use ``str.encode`` to pass already-encoded bytes to urllib3 (`#3053 <https://github.com/urllib3/urllib3/issues/3053>`__). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str body chunks
seems somewhat unclear. e.g Does it include string bodies sent as chunked encoding? I feel that using StringIO bodies
or string streams
would be clearer?
@@ -227,7 +227,7 @@ def chunk_readable() -> typing.Iterable[bytes]: | |||
if not datablock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment unchanged lines, but please also specify the utf-8
encoding to the to_bytes
call. Explicit is better than implicit.
@@ -2380,7 +2380,7 @@ def body_generator() -> typing.Generator[bytes, None, None]: | |||
body.seek(0, 0) | |||
should_be_chunked = True | |||
elif body_type == "file_text": | |||
body = io.StringIO("x" * 10) | |||
body = io.StringIO("x\x80\x81") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want all chunked bodies to be encoded to UTF-8, not only StringIO. Can you please use x * 9 + \x80
everywhere?
Closes 3053
I am not sure if I should add a warning.