Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change headers to a dict that parses comma-separated values #7679

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Dreamsorcerer
Copy link
Member

@Dreamsorcerer Dreamsorcerer commented Oct 8, 2023

This is a proposal to change the headers from a CIMultiDict to a more regular dict (in v4). The problem with the multidict approach is that list headers (i.e. headers that can have multiple values) can have values combined in single headers and/or split over multiple headers.


Basically, these 2 payloads should be considered equivalent:

Foo: 1
Foo: 2
Foo: 1, 2

But, currently aiohttp will produce in the first case:

headers["Foo"]  # "1"
headers.getall("Foo")  # ["1", "2"]

And, in the second case:

headers["Foo"]  # "1, 2"
headers.getall("Foo")  # ["1, 2"]

The spec recommends concatenating duplicate headers together with ", ". This is also what the vast majority of existing software does (including requests).

The only problem with this, is that if the user wants the values as a list, they are left to parse the value themselves, which when accounting for quoted values becomes quite complex and easy to get wrong on edge cases. So, in this proposal I've concatenated the values as recommended, but added a .getall() method which parses the final value to get the list.

With the code in this PR, both of the previous payloads produce the same output:

headers["Foo"]  # "1, 2"
headers.getall("Foo")  # ["1", "2"]

"Field value" now refers to the value after multiple field lines are combined with commas -- by far the most common use.
From RFC 9110 appendix B2

From kenballus's testing:

Servers that join the duplicate headers by default: Apache httpd, Caddy, Gunicorn, H2O, IIS, Lighttpd, Nginx, Node.js, Puma.
Servers that accept the duplicate headers without joining them: aiohttp (currently), Boost::Beast, Mongoose, Tornado

I've also seen that requests combines into a regular dict. The only other library I've seen that uses a multidict for this is Starlette.

Also: https://www.rfc-editor.org/rfc/rfc9110.html#name-recipient-requirements

@Dreamsorcerer Dreamsorcerer added the backport:skip Skip backport bot label Oct 8, 2023
@Dreamsorcerer Dreamsorcerer added this to the 4.0 milestone Oct 8, 2023
@steverep
Copy link
Contributor

Some questions to consider here:

  1. Should the getAll list be converted to lowercase when the values are case-insensitive (which is true for most if not all I think)?
  2. If yes, should the list then also be de-duplicated?
  3. If yes, should content negotiation headers that can have quality values per RFC 9110 be parsed and assigned, e.g. by returning a dictionary instead of a list or tuple as {"<value>": <quality>, ...}?

I think all of these being done directly in the parsers is best for performance, and would make for easier and less error-prone usage.

HEXDIGIT = re.compile(rb"[0-9a-fA-F]+")


class HeadersDictProxy(Mapping[str, str]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared performance compared to inheriting multidict and overriding the behaviors of storing the data and outputting the combined values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation details are not a concern yet, so will come back to that. First, I want to get consensus that this is the correct approach and should be changed in v4, given that it is likely to cause backwards-compatibility breakages for atleast a small proportion of users.

@Dreamsorcerer
Copy link
Member Author

Some questions to consider here:

If you've got any information in the specs to answer those questions, that'd be great to have.

@steverep
Copy link
Contributor

steverep commented Feb 3, 2024

If you've got any information in the specs to answer those questions, that'd be great to have.

Just by their nature, I think it's certainly safe to deduplicate the content negotiation fields defined in Section 12.5 of RFC 9110. However, AFAICT, the RFC has no guidance on what quality value to assign if the duplicates happen to disagree. Seems like server's choice in that edge case would be conformant.

Other list headers should not be deduplicated because duplicates can actually mean something. For example, Content-Encoding: "gzip, gzip" means the content was double compressed with gzip.

@Dreamsorcerer
Copy link
Member Author

OK, now having time to think over this, I think it's all a level above what we should be doing here. I think we're just providing a list based on the definition of a general HTTP field. So, I feel the answer is no to all 3 questions. Maintaining logic for all the different kind of headers seems out of scope to me (and if we did, it'd be through dedicated attributes, like cookies).

@steverep
Copy link
Contributor

steverep commented Feb 4, 2024

OK, now having time to think over this, I think it's all a level above what we should be doing here. I think we're just providing a list based on the definition of a general HTTP field. So, I feel the answer is no to all 3 questions. Maintaining logic for all the different kind of headers seems out of scope to me (and if we did, it'd be through dedicated attributes, like cookies).

After reading the spec a bit more, I guess I agree with you on the first two, but parameters are actually generically defined in section 5.6.6. The syntax and case-insensitivity of parameter names is defined there (the content negotiation headers just happen to use "q" as the name). I think the parsers should provide a way to access them as a dictionary (maybe by returning a list subclass?).

(('"applebanna, this',), ('"applebanna', "this")),
(('fooo", "bar"',), ('fooo"', "bar")),
((" spam , eggs ",), ("spam", "eggs")),
((" spam ", " eggs "), ("spam", "eggs")),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add tests for escaped quotes (e.g. "foo\"bar"), maybe also escaped backslash, if that's valid (e.g. "foo\\" or "foo\\\"").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip Skip backport bot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants