New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow byte values in HTTPHeaderDict #3279
base: main
Are you sure you want to change the base?
allow byte values in HTTPHeaderDict #3279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I realize we have a bounty for this issue, I'm not sure that blindly assuming latin-1
in 2024 is the right thing to do. In RFC 9110 (started in 2018 and published in 2022 and which applies to all HTTP versions), Section 5.5. Field Values states (emphasis mine):
Field values are usually constrained to the range of US-ASCII characters [USASCII]. Fields needing a greater range of characters can use an encoding, such as the one defined in [RFC8187]. Historically, HTTP allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. Specifications for newly defined fields SHOULD limit their values to visible US-ASCII octets (VCHAR), SP, and HTAB. A recipient SHOULD treat other allowed octets in field content (i.e., obs-text) as opaque data.
What we could do, if the user insists in sending bytes, is to join values using b", "
, which I guess could work? Or we could refuse bytes altogether. (The current exception is not helpful and I'd like to fix that somehow.)
Also, would you mind checking what encoding is used when "Schönefeld/1.18.0" is passed as a string?
Do you mean
What do you suggest here? I feel like It'd be better to be able to send bytes. However, it seems like it is not possible right now to access raw response headers? Because we rely on http.client for them, which appears to decode them as 'latin-1'. This state seems slightly odd. i.e to be able to send raw byte values but not be able to receive them. [*] Edited for clarity |
Yes, As pointed out in #2397 (comment), it makes sense to be able to send headers as bytes, because the user may not want to send latin-1. As mentioned in the HTTP spec referenced above, So my current thinking is that we should indeed join byte values using @sethmlarson By the way, did your opinion change since 2021? Should we explicitly allow bytes in the API as was done in this pull request? |
04da19e
to
4000221
Compare
@pquentin I have updated the PR. What do you think now? PS: There's some lint issues. Plus we need to update some type annotations. Do let me know if they are blockers for a review. |
@zawan-ila Yes, type annotation failures usually block review, especially for a typing change and with so many errors:
That said, #3343 made this pull request much more important so I'll take a look anyway! Sorry for not answering before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work!
Can you please fix the mypy errors before I can review in more detail? I suspect this will involve explicitly typing self._container
as Mapping[str, str | bytes]
.
You will also need to add a changelog entry.
And finally, we need to test with -bb
to check for string/bytes comparisons, but the use of the h2 library makes that more difficult.
src/urllib3/_collections.py
Outdated
assert type(vals[-1]) is type( | ||
val | ||
), "Can not mix strings and bytes in header values" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be an assert. asserts are reserved for logic error in our own code and can be eliminated in production. You should raise a TypeError
here for mixed values.
sock.sendall( | ||
b"HTTP/1.1 200 OK\r\n" | ||
b"Server: example.com\r\n" | ||
b"Content-Length: 0\r\n" + request_headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please send request headers in the body to avoid the decoding that is done by httplib on responses.
# Note that b"\xc2\x80" is Â\x80 when decoded as latin-1 | ||
assert r.headers["rawcookie"] == "Â\x80" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above change, you should ultimately be able to write assert r.data == b"rawcookie: \xc2\x80"
@pquentin I've updated the types. However I am not too happy with my work in this regard. There are too many ignores. The main difficulties appear to me to be the following:-
It would be really nice if you could have a look and provide some helpful suggestions. [PS] I also now understand why type annotations ought to be a blocker for review :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While request header values can be bytes, response header values are not.
Saying that all header values can be bytes is not a problem in itself, but if it requires adding type ignores everywhere, then we could reconsider. I think this depends on what we want to do with HTTP/2 headers, which we haven't defined yet.
HTTP/2 headers passed to the h2 library have to be bytes, but we can ask configure h2 to decode our headers. If we do that, then we could use generics to have one HTTPHeaderDict for requests (support str | bytes) and one for responses (only supporting str), which would help.
@sethmlarson @illia-v What do you think?
The values in self._container are lists such that the first value is the header string and the rest are the values. Hence one is forced to type them as list[str | bytes]
Yeah, that's a hack to keep the original case of the key. But at least it's self-contained in HTTPHeaderDictView. We could solve this by introducing a second dict, but I would need to see microbenchmarks and a memory profile to see the performance impact.
This is my first time using mypy.
That's OK! We're here to help, and you're doing great anyway. I'm happy that you're also trying to find a clean solution. Also, sorry for the delay between reviews.
[PS] I also now understand why type annotations ought to be a blocker for review :-)
:-)
@@ -155,7 +155,7 @@ def keys(self) -> set[_KT]: # type: ignore[override] | |||
return set(self._container.keys()) | |||
|
|||
|
|||
class HTTPHeaderDictItemView(typing.Set[typing.Tuple[str, str]]): | |||
class HTTPHeaderDictItemView(typing.Set[typing.Tuple[str, typing.Union[str, bytes]]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error do you get when using str | bytes
? Union types are supported by Python 3.10+ (https://docs.python.org/3/library/stdtypes.html#types-union), and I can see other places (line 240) where they appear to work just fine.
@@ -629,7 +629,7 @@ def test_request_with_json(self, headers: HTTPHeaderDict) -> None: | |||
content_type = HTTPHeaderDict(old_headers).get( | |||
"Content-Type", "application/json" | |||
) | |||
assert content_type in r.headers["Content-Type"].replace(" ", "").split(",") | |||
assert content_type in r.headers["Content-Type"].replace(" ", "").split(",") # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the mypy error here?
@@ -339,7 +341,7 @@ def __init__( | |||
self.chunked = False | |||
tr_enc = self.headers.get("transfer-encoding", "").lower() | |||
# Don't incur the penalty of creating a list and then discarding it | |||
encodings = (enc.strip() for enc in tr_enc.split(",")) | |||
encodings = (enc.strip() for enc in tr_enc.split(",")) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the mypy error here?
Closes 3072
Overview
I've followed the same path that supports byte keys. Open to any other suggestions.