Can't modify type of headers object within HTTPConnection.request() #2021

sethmlarson · 2020-10-09T14:47:43Z

Captured from these two comments:

Follow this logic:

Get the list of headers
If they're all bytes return b", ".join(val[1:])
Otherwise return ", ".join(val[1:])

This way if a user uses mixed types they still get the same explosion? I'd like a little poking around with what happens when httplib is passed both bytes and str headers on Python 2.x since it had strange behavior when doing so with HTTP method and URL.

cc @jalopezsilva

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2020-10-09T22:35:46Z

Where do we then normalize things and what if someone adds a value that's bytes and another that's a str?

Edit to add an example:

headers = HTTPHeaderDict()
headers.add("Cookie", "sessionId=abc1234")
headers.add("Cookie", b"lang=en-US")

I don't think headers["Cookie"] should raise an exception there other than KeyError. If we want to raise exceptions like that, I'd strongly prefer deprecating the __getitem__, __setitem__, etc. methods because we really shouldn't be using the dictionary interface at that point. Also, not all headers are valid as joined strings like that, so if someone instead did

headers = HTTPHeaderDict()
headers.add("Content-Length", b"7")
headers.add("Content-Length", "8")

We shouldn't be emitting Content-Length: 7, 8 but I think we would do that here today. I'd rather us have something like .get() which raises that exception leaving getlist as well to get the full list (and their partners)

sigmavirus24 · 2020-10-10T18:10:00Z

I've been mulling over this more than I probably should. I, for one, would rather see us pick a "right" way to pass data to urllib3. In other words, I'd rather see us standardize (somehow) on bytes. I find that trying to use text can be a bit too tricky in something like this especially if we take the stance of utf-8 is default. That'll break horribly for some people and we need to give them an escape hatch. Meanwhile, if we build everything as bytes, then we can build something on top to convert to text by default or let users do that when they need it.

An alternative, if we want to provide everything, which I'm less fond of us taking a tact like the standard library. urllib.parse has both ParseResult and ParseResultBytes (rfc3986 copied this for API compatibility). It means that within one class you can be confident in what you're using and reject the input you don't want confidently. We could have HTTPHeaderDict and HTTPHeaderBytesDict (although, if we go with my preference in my last comment, I'd rather this just be HTTPHeaders and HTTPHeadersBytes dropping "Dict" because we won't support the dictionary access interface).

sethmlarson · 2020-10-11T22:18:49Z

Just putting some thoughts down, here's the state of things:

botocore reimplements HTTPConnection._send_request() which calls headers.get("Expect", b"") == b"100-continue"
botocore sends request(..., headers={"Expect": b"100-continue"}). Note this is a dict.
urllib3 1.25.10 doesn't touch headers at all on the way down to http.client
urllib3 master HTTPConnection.request() changes headers into a HTTPHeaderDict so once it reaches the .get() call in botocore it explodes because of the ", ".join(val[1:]) call
(Note that HTTPConnection.request_chunked() converts headers to HTTPHeaderDict and has been doing so for 5 years but doesn't hit ._send_request() because it only calls putheader(), endheaders(), and send())

Given this information the change that allows skipping user-agent would need to be tweaked to account for this and pass headers as the same type as input to HTTPConnection._send_request() in v1.26.0.

For v2:

I'm not sure how much value versus we get by changing much of what HTTPHeaderDict does currently (beyond maybe normalizing input key/val to str (via encoding key in ascii and val in latin-1) which is how most users use the interface?

Regarding the other points raised:

Definitely agree that the mapping interfaces are not optimal, however users most likely prefer these and they're what http.client expects.
Maybe we can add a function get_folded() which does what __getitem__() does currently?
Accessing "Cookie" should raise an exception when accessed with a folding method
Receiving Content-Length: 6 and Content-Length: 7 should raise InvalidHeader
I like the idea of normalizing Content-Length multiples (ie Content-Length: 1,1,1 -> Content-Length: 1)

sigmavirus24 · 2020-10-11T23:17:53Z

Definitely agree that the mapping interfaces are not optimal, however users most likely prefer these and they're what http.client expects.

We're going to likely store a dict under the covers regardless, can we not simply do a translation?

Maybe we can add a function get_folded() which does what __getitem__() does currently?

Then what does __getitem__ do when there are multiples? ValueError? Something else?

Accessing "Cookie" should raise an exception when accessed with a folding method

Huh? Why? Cookie headers to servers can be safely folded just not with , but with ;. It's folding just not with ,.

Receiving Content-Length: 6 and Content-Length: 7 should raise InvalidHeader

Depends on how you receive it? If we're keeping dictionaries then I'd argue __setitem__ should just override it.

I like the idea of normalizing Content-Length multiples (ie Content-Length: 1,1,1 -> Content-Length: 1)

So then we're introducing a special case where we don't raise InvalidHeader? How about an alternate idea:

We can classify headers (more or less):

Standard/Common RFC Defined and Foldable (e.g., Cookie, Content-Encoding, etc.)
Standard/Common RFC Defined and Not Foldable (e.g., Content-Length, Content-Type, Host)
"Custom" and Probably Foldable (e.g., Request-Id, Server-Timing, etc.)
(Maybe others I'm missing)

It shouldn't be hard to have a DuplicateHeaderValue (or ExtraHeaderValue) exception for something that's not foldable (or thought about differently, shouldn't have multiple values). If we weren't committed to a dictionary interface here, we could have add() which sets or extends a header's value or raises DuplicateHeaderValue (or whatever) if the key shouldn't have more than one value. We could provide .override()/.replace() for folks wanting to replace that.

I think that over the years, a big problem people run into (not just using requests/urllib3 but in many languages) is duplicating headers they shouldn't be able to. In requests I think this is most common when someone sets Host and something below requests duplicates that regardless of the setting in headers={}. I've seen Jetty servers send Transfer-Encoding: chunked, chunked. That caused a tonne of trouble for my friend using requests because we (requests+urllib3) didn't handle that well. Yeah, if we de-duplicated that, nothing would have been an issue, but I think if we change the interface to HTTPHeaders we eliminate a footgun.

sethmlarson · 2020-10-12T02:12:24Z

Just an FYI I'm going to respond out-of-order from your reply.

Maybe we can add a function get_folded() which does what __getitem__() does currently?

Then what does __getitem__ do when there are multiples? ValueError? Something else?

Currently it raises a TypeError if the multiples are different types. What it doesn't do is join

Accessing "Cookie" should raise an exception when accessed with a folding method

Huh? Why? Cookie headers to servers can be safely folded just not with , but with ;. It's folding just not with ,.

I was confusing Cookie behavior with Set-Cookie behavior, sorry for the confusion.

Definitely agree that the mapping interfaces are not optimal, however users most likely prefer these and they're what http.client expects.

We're going to likely store a dict under the covers regardless, can we not simply do a translation?

Receiving Content-Length: 6 and Content-Length: 7 should raise InvalidHeader

Depends on how you receive it? If we're keeping dictionaries then I'd argue __setitem__ should just override it.

I like the idea of normalizing Content-Length multiples (ie Content-Length: 1,1,1 -> Content-Length: 1)

So then we're introducing a special case where we don't raise InvalidHeader? How about an alternate idea:

We can classify headers (more or less):
* Standard/Common RFC Defined and Foldable (e.g., `Cookie`, `Content-Encoding`, etc.)

* Standard/Common RFC Defined and Not Foldable (e.g., `Content-Length`, `Content-Type`, `Host`)

* "Custom" and Probably Foldable (e.g., `Request-Id`, `Server-Timing`, etc.)
  (Maybe others I'm missing)
It shouldn't be hard to have a DuplicateHeaderValue (or ExtraHeaderValue) exception for something that's not foldable (or thought about differently, shouldn't have multiple values). If we weren't committed to a dictionary interface here, we could have add() which sets or extends a header's value or raises DuplicateHeaderValue (or whatever) if the key shouldn't have more than one value. We could provide .override()/.replace() for folks wanting to replace that.

I think that over the years, a big problem people run into (not just using requests/urllib3 but in many languages) is duplicating headers they shouldn't be able to. In requests I think this is most common when someone sets Host and something below requests duplicates that regardless of the setting in headers={}. I've seen Jetty servers send Transfer-Encoding: chunked, chunked. That caused a tonne of trouble for my friend using requests because we (requests+urllib3) didn't handle that well. Yeah, if we de-duplicated that, nothing would have been an issue, but I think if we change the interface to HTTPHeaders we eliminate a footgun.

So with v2 we can work a lot more on request headers. Currently HTTPHeaderDict is being used for HTTPResponse and as a case-insensitive dict within request_chunked() but it's use is incidental/doesn't impact any other APIs unlike .request(). It being used within HTTPConnection.request() was a mistake and will be fixed. I think there'd be value in opening up a separate discussion issue specifically for request headers. This will be especially important for trying to keep functional API compatibility and being strict about what types the headers parameter receives at each stage.

sigmavirus24 · 2020-10-12T14:30:02Z

One last thought, because some headers can be repeated, I'm genuinely of the opinion that dictionaries are the wrong way to think about headers. They really are a sequence of names and values. Accessing them like a dictionary is convenient, but part of me wishes we could standardize on a list of tuples (which would be hard for deduplication, etc.)

sethmlarson mentioned this issue Oct 9, 2020

Add SKIP_HEADER for skipping automatically added headers #2018

Merged

sethmlarson added this to the v1.26 milestone Oct 9, 2020

sethmlarson changed the title ~~HTTPHeaderDict.__getitem__() should return same type as input if possible~~ Can't modify type of headers object within HTTPConnection.request() Oct 11, 2020

sethmlarson closed this as completed in #2018 Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't modify type of headers object within HTTPConnection.request() #2021

Can't modify type of headers object within HTTPConnection.request() #2021

sethmlarson commented Oct 9, 2020 •

edited

sigmavirus24 commented Oct 9, 2020 •

edited

sigmavirus24 commented Oct 10, 2020

sethmlarson commented Oct 11, 2020

sigmavirus24 commented Oct 11, 2020

sethmlarson commented Oct 12, 2020

sigmavirus24 commented Oct 12, 2020

Can't modify type of headers object within HTTPConnection.request() #2021

Can't modify type of headers object within HTTPConnection.request() #2021

Comments

sethmlarson commented Oct 9, 2020 • edited

sigmavirus24 commented Oct 9, 2020 • edited

sigmavirus24 commented Oct 10, 2020

sethmlarson commented Oct 11, 2020

sigmavirus24 commented Oct 11, 2020

sethmlarson commented Oct 12, 2020

sigmavirus24 commented Oct 12, 2020

sethmlarson commented Oct 9, 2020 •

edited

sigmavirus24 commented Oct 9, 2020 •

edited