Change headers to a dict that parses comma-separated values #7679

Dreamsorcerer · 2023-10-08T18:15:38Z

This is a proposal to change the headers from a CIMultiDict to a more regular dict (in v4). The problem with the multidict approach is that list headers (i.e. headers that can have multiple values) can have values combined in single headers and/or split over multiple headers.

Basically, these 2 payloads should be considered equivalent:

Foo: 1
Foo: 2

Foo: 1, 2

But, currently aiohttp will produce in the first case:

headers["Foo"]  # "1"
headers.getall("Foo")  # ["1", "2"]

And, in the second case:

headers["Foo"]  # "1, 2"
headers.getall("Foo")  # ["1, 2"]

The spec recommends concatenating duplicate headers together with ", ". This is also what the vast majority of existing software does (including requests).

The only problem with this, is that if the user wants the values as a list, they are left to parse the value themselves, which when accounting for quoted values becomes quite complex and easy to get wrong on edge cases. So, in this proposal I've concatenated the values as recommended, but added a .getall() method which parses the final value to get the list.

With the code in this PR, both of the previous payloads produce the same output:

headers["Foo"]  # "1, 2"
headers.getall("Foo")  # ["1", "2"]

"Field value" now refers to the value after multiple field lines are combined with commas -- by far the most common use.
From RFC 9110 appendix B2

From kenballus's testing:

Servers that join the duplicate headers by default: Apache httpd, Caddy, Gunicorn, H2O, IIS, Lighttpd, Nginx, Node.js, Puma.
Servers that accept the duplicate headers without joining them: aiohttp (currently), Boost::Beast, Mongoose, Tornado

I've also seen that requests combines into a regular dict. The only other library I've seen that uses a multidict for this is Starlette.

Also: https://www.rfc-editor.org/rfc/rfc9110.html#name-recipient-requirements

for more information, see https://pre-commit.ci

steverep · 2024-01-31T21:50:09Z

Some questions to consider here:

Should the getAll list be converted to lowercase when the values are case-insensitive (which is true for most if not all I think)?
If yes, should the list then also be de-duplicated?
If yes, should content negotiation headers that can have quality values per RFC 9110 be parsed and assigned, e.g. by returning a dictionary instead of a list or tuple as {"<value>": <quality>, ...}?

I think all of these being done directly in the parsers is best for performance, and would make for easier and less error-prone usage.

webknjaz · 2024-02-01T00:10:31Z

aiohttp/http_parser.py

 HEXDIGIT = re.compile(rb"[0-9a-fA-F]+")


+class HeadersDictProxy(Mapping[str, str]):


Have you compared performance compared to inheriting multidict and overriding the behaviors of storing the data and outputting the combined values?

Implementation details are not a concern yet, so will come back to that. First, I want to get consensus that this is the correct approach and should be changed in v4, given that it is likely to cause backwards-compatibility breakages for atleast a small proportion of users.

Dreamsorcerer · 2024-02-02T13:29:30Z

Some questions to consider here:

If you've got any information in the specs to answer those questions, that'd be great to have.

steverep · 2024-02-03T23:27:47Z

If you've got any information in the specs to answer those questions, that'd be great to have.

Just by their nature, I think it's certainly safe to deduplicate the content negotiation fields defined in Section 12.5 of RFC 9110. However, AFAICT, the RFC has no guidance on what quality value to assign if the duplicates happen to disagree. Seems like server's choice in that edge case would be conformant.

Other list headers should not be deduplicated because duplicates can actually mean something. For example, Content-Encoding: "gzip, gzip" means the content was double compressed with gzip.

Dreamsorcerer · 2024-02-04T00:19:42Z

OK, now having time to think over this, I think it's all a level above what we should be doing here. I think we're just providing a list based on the definition of a general HTTP field. So, I feel the answer is no to all 3 questions. Maintaining logic for all the different kind of headers seems out of scope to me (and if we did, it'd be through dedicated attributes, like cookies).

steverep · 2024-02-04T05:16:03Z

OK, now having time to think over this, I think it's all a level above what we should be doing here. I think we're just providing a list based on the definition of a general HTTP field. So, I feel the answer is no to all 3 questions. Maintaining logic for all the different kind of headers seems out of scope to me (and if we did, it'd be through dedicated attributes, like cookies).

After reading the spec a bit more, I guess I agree with you on the first two, but parameters are actually generically defined in section 5.6.6. The syntax and case-insensitivity of parameter names is defined there (the content negotiation headers just happen to use "q" as the name). I think the parsers should provide a way to access them as a dictionary (maybe by returning a list subclass?).

Dreamsorcerer · 2024-02-04T13:26:35Z

tests/test_http_parser.py

+        (('"applebanna, this',), ('"applebanna', "this")),
+        (('fooo", "bar"',), ('fooo"', "bar")),
+        ((" spam , eggs ",), ("spam", "eggs")),
+        ((" spam ", " eggs "), ("spam", "eggs")),


TODO: Add tests for escaped quotes (e.g. "foo\"bar"), maybe also escaped backslash, if that's valid (e.g. "foo\\" or "foo\\\"").

Change headers to a dict that parses comma-separated values

4f0d53e

Dreamsorcerer added the backport:skip Skip backport bot label Oct 8, 2023

Dreamsorcerer added this to the 4.0 milestone Oct 8, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

b2225ce

for more information, see https://pre-commit.ci

steverep mentioned this pull request Jan 31, 2024

Accept-Encoding header parsing and interpretation #8104

Open

4 tasks

webknjaz reviewed Feb 1, 2024

View reviewed changes

Dreamsorcerer commented Feb 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change headers to a dict that parses comma-separated values #7679

Change headers to a dict that parses comma-separated values #7679

Dreamsorcerer commented Oct 8, 2023 •

edited

steverep commented Jan 31, 2024

webknjaz Feb 1, 2024

Dreamsorcerer Feb 2, 2024

Dreamsorcerer commented Feb 2, 2024

steverep commented Feb 3, 2024

Dreamsorcerer commented Feb 4, 2024

steverep commented Feb 4, 2024

Dreamsorcerer Feb 4, 2024

		HEXDIGIT = re.compile(rb"[0-9a-fA-F]+")


		class HeadersDictProxy(Mapping[str, str]):

Change headers to a dict that parses comma-separated values #7679

Are you sure you want to change the base?

Change headers to a dict that parses comma-separated values #7679

Conversation

Dreamsorcerer commented Oct 8, 2023 • edited

steverep commented Jan 31, 2024

webknjaz Feb 1, 2024

Choose a reason for hiding this comment

Dreamsorcerer Feb 2, 2024

Choose a reason for hiding this comment

Dreamsorcerer commented Feb 2, 2024

steverep commented Feb 3, 2024

Dreamsorcerer commented Feb 4, 2024

steverep commented Feb 4, 2024

Dreamsorcerer Feb 4, 2024

Choose a reason for hiding this comment

Dreamsorcerer commented Oct 8, 2023 •

edited