Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow response headers to contain array of values #1598

Closed
ioquatix opened this issue Feb 11, 2020 · 34 comments
Closed

Allow response headers to contain array of values #1598

ioquatix opened this issue Feb 11, 2020 · 34 comments

Comments

@ioquatix
Copy link
Member

A while ago I discussed with @tenderlove - we currently use \n characters to separate multiple headers. e.g.

{"set-cookie" => "foo=x\nbar=y"}

Internally HeaderHash uses an Array I believe, e.g. something like:

{"set-cookie" => ["foo=x", "bar=y"]}

Personally I like this formulation better, it's more efficient for the server implementation and it's easier from the user POV to deal with too. It seems like less of a hack, and it should minimise

So, I vote to deprecate/remove the "\n" representation and prefer the Array representation.

The only limitation is going to be if people start using it everywhere, e.g..

{"content-type" => ["text/html"]}

There are different solutions to this problem: e.g., we could define only certain headers can work like this (notably set-cookie would make sense) or we could mark certain headers as NOT being allowed to be an array (e.g. content-length).

(Epic #1593)

@boazsegev
Copy link

boazsegev commented Feb 12, 2020

I vote to deprecate/remove the "\n" representation and prefer the Array representation.

What an amazing coincidence, I almost posted an issue about this a few weeks ago.

However, after writing it down and adding benchmarks to show why it's faster (by ~40% in simple cases), I realized that servers will have to support both variants (the \n and the Array value) due to backwards compatibility.

In the end, the positive change in the SPECs might not translate to a positive change in practice.

For me (iodine), the \n promises cache locality when parsing the response. However, the fact that the application has to concat strings, which requires memory reallocation and data copying... switching to an Array is much better.

The only limitation is going to be if people start using it everywhere

This will be the same as current use case for \n, which can be used anywhere (including invalid places such as content-length).

I think this part of header "correctness" could be left to common sense.

@ioquatix
Copy link
Member Author

ioquatix commented Feb 12, 2020

Most headers can be split by comma. The only one which does not follow this rule that I'm aware of is set-cookie. In Protocol::HTTP::Headers this is encoded specifically in the representation. I followed this model from the Mozilla implementation: https://github.com/socketry/protocol-http/blob/master/lib/protocol/http/headers.rb#L159-L161

One way to deal with this is to simply allow set-cookie to be the only array, i.e. if defined it must be an array of zero or more values which get written out as individual headers.

Even thought it's kind of ugly to specify custom behaviour, at least it would make it:

  • Easy for servers to do the right thing
  • Avoid supporting "\n" in headers where it doesn't make sense (which is every other header AFAIK).
  • Predictable user code can just do the right thing, general headers vs set-cookie.
  • In terms of compatibility, most code is using Rack::Response or the Utils to add cookies to the response headers anyway. So I don't think this would be a source of major problems.

In fact, considering HTTP headers as general metadata is generally a mistake, in my experience. There are too many things which have specific semantic behaviour which requires specific things to handle.

@boazsegev
Copy link

Most headers can be split by comma.

They may, but they don't have to.

Servers (and clients) are both allowed to send the headers in multiple lines rather than comma separated values. i.e. transfer-encoding which may appear more than once and may include chunked, gzip or both.

I don't think Rack should dictate this decision.

There are too many things which have specific semantic behaviour which requires specific things to handle.

Yes, there are, and every day new standards or proposals might pop up.

The wonderful thing about specifying a unified behavior to all HTTP headers, is that the specification isn't fragile - it will still apply to any future change / update.

By allowing developers and servers to make these "mistakes" possible (i.e., multiple content-length headers), we are actually keeping Rack more resilient to changes.

For example, what if multiple content-length headers became a way to indicate multipart messages and their sections (it's easier and safer then writing mime-boundaries)...?

Easy for servers to do the right thing

IMHO, I don't think we need to safe-proof everything. The Rack specification isn't a server or a framework. It doesn't need to require response validation.

Some servers could offer security (at the cost of performance) while testing response validity, others could allow mistakes to pass through.

@ioquatix
Copy link
Member Author

Even puma stuffed this up in the latest release. So, I think we should consider what we do here.

https://twitter.com/nateberkopec/status/1233172403389259782

I don't think Rack should dictate this decision.

Actually, the generic representation of headers is an array of key, value pairs. Rack, by using a hash, is already imposing a data structure which can't represent all possible sets of headers and retain the original order. In order to allow for multiple values, Rack specifies that value strings can contain "\n". But this only really works in the case of set-cookie. If you wanted to implement a proxy, you actually need to correctly handle headers as an array, and correctly deal with hop headers, etc. So, Rack is already making a big assumption about headers, but the implementation is kind of confusing and prone to breakage, as shown by Puma's implementation above.

@boazsegev
Copy link

@ioquatix ,

It is clear to me that adding array support could result in an annoying mix between new and old behavior, such as:

{'set-cookie' => [ "cookie1=foo\ncookie2=bar", "cookie3=baz"]}

I believe that your idea is wise, but I also believe it's impractical unless we make breaking changes that prevent the old behavior as a whole.

Even puma stuffed this up in the latest release...

I think the fact that the Puma "patch" was, IMHO, ill considered and may have ignored the Rack specification, does not make the specification "bad", nor does it mean that other servers now need to add support for more variations.

@boazsegev
Copy link

P.S.

I understand that Header values are more comfortable to find when stored in a Hash (incoming headers).

However, I wonder if header output should be editable after being set. Does anyone "fix" their headers after setting them?

Changing the Header response to an Array of Arrays might both improve performance and solve this design issue.

I know, it's a backwards compatibility breaking change, but I think breaking backwards compatibility is the only way to update this part of the specification without adding complexity.

🤔

@ioquatix
Copy link
Member Author

Array of Arrays

This is the model I use in protocol-http and it's good.

@jeremyevans
Copy link
Contributor

However, I wonder if header output should be editable after being set. Does anyone "fix" their headers after setting them?

In terms of middleware accessing or modifying response headers, yes, it is very common. For some examples, look at middleware that ships with Rack.

Changing the Header response to an Array of Arrays might both improve performance and solve this design issue.

This turns accessing or modifying response headers from O(1) to O(N) operation. Granted, N in this case is not usually that large.

For the types of operations usually performed on response headers by middleware, using a hash would generally result in simpler code than using an array of arrays. Many middleware assume a hash-like interface for response headers already and would break with a switch to array of arrays.

@boazsegev
Copy link

using a hash would generally result in simpler code than using an array of arrays.

I agree with this reasoning 100% - development and maintenance ease is super important. There's so much to do. Keeping developers happy and development time short is a very strong argument.

In terms of middleware accessing or modifying response headers, yes, it is very common. For some examples, look at middleware that ships with Rack.

You're probably right 😅, though I wish you weren't 🙈

Funny enough, I looked at some of these examples when I was designing iodine. I think many of these concerns are server concerns that could (IMHO, should) be shifted to servers. i.e., Content-Length validation, date headers, chunked encoding etc'.

Not that it matters, as it's too often that developers require these additional layers.

@ioquatix
Copy link
Member Author

ioquatix commented Feb 28, 2020

I’m very strong on the opinion that content-length should not be modified or set by user code. As we saw, it breaks stuff. That’s been my default position in falcon since day one and content-length is ignored by falcon where possible. As well as transfer-encoding, connection, etc, which also should not be set by user. These are protocol level concerns.

All hop headers are ignored: https://github.com/socketry/falcon/blob/master/lib/falcon/adapters/response.rb#L61-L65

			HOP_HEADERS = [
				'connection',
				'keep-alive',
				'public',
				'proxy-authenticate',
				'transfer-encoding',
				'upgrade',
			]

content-length is ignored unless it can't be reliably calculated: https://github.com/socketry/falcon/blob/master/lib/falcon/adapters/output.rb#L34-L53

Falcon will also fail when the content-length and the content don't match up: https://github.com/socketry/protocol-http1/blob/19f55fb5d8bde238e8df7c9f302e86595cdb9ea8/lib/protocol/http1/connection.rb#L304-L306

@boazsegev
Copy link

Sorry, I think I got us all sidetracked.

@ioquatix , @jeremyevans , do you agree with my previous assessment that this suggestion only adds complexity (as we will still need to support the existing approach)?

If so, then unless we forgo backwards compatibility, perhaps we could close this issue?

@jeremyevans
Copy link
Contributor

Well, this definitely adds internal complexity. It can make some things simpler for the user, such as supporting multiple values for the same cookie. However, some things could be more complex for the user, since all middleware that needs to operate on header values would need to handle both the string case and the array case. Weighing both sides, I'm against adding allowing response header values to be arrays.

@ioquatix
Copy link
Member Author

I think allowing set-cookie to be an array makes total sense. It's the only header where it makes sense to have multiple values in terms of how Rack implements response headers. If we define only that one header as an array, our internal implementation gets a bit faster and cleaner, and we also ensure that servers are efficient because no other header can contain "\n" and thus we don't need to process every response header expecting that it might contain multiple values.

@jeremyevans
Copy link
Contributor

It may be simpler to make the spec be that only set-cookie can contain "\n", and all other headers cannot contain it. That simplifies response processing in servers to roughly the same degree, and also keeps the spec simpler by using a consistent data type. However, I have not read the RFCs enough to know if only set-cookie can show up multiple times. Unless only set-cookie is allowed to show up multiple times, I think it makes sense to keep the current behavior of allowing it for all headers.

It smells kind of funny to make SPEC different for specific response header keys, but considering we treat some request header names differently in SPEC, it isn't too much of a stretch.

@ioquatix
Copy link
Member Author

I would be okay with that but internally we already use an array so it still seems inefficient.

@jeremyevans
Copy link
Contributor

I wasn't aware rack used an array internally for set-cookie, can you point me where that happens, as that may cause me to reconsider.

@ioquatix
Copy link
Member Author

I haven't fully worked through it, but you can see here:

rack/lib/rack/utils.rb

Lines 248 to 257 in 838ce3a

case header
when nil, ''
cookie
when String
[header, cookie].join("\n")
when Array
(header + [cookie]).join("\n")
else
raise ArgumentError, "Unrecognized cookie header value. Expected String, Array, or nil, got #{header.inspect}"
end

header can be an array.

HeaderHash internally allows values to be an array: https://github.com/rack/rack/blob/master/lib/rack/utils.rb#L440-L444

I can dig into it further if you need me to.

@jeremyevans
Copy link
Contributor

Thanks, that is helpful. It seems both of those areas assume that callers may be violating SPEC. In both cases, the code may work with existing array values, but they generate string values and not arrays in order to comply with SPEC. So it looks like they are both following Postel's Law. It doesn't look like there is much if any inefficiency in these two cases.

I think I'm fine keeping that behavior in Utils for compatibility, but I still think we should not modify SPEC to allow response header values to be arrays.

@ioquatix
Copy link
Member Author

HeaderHash does it as an optimisation to avoid generating a lot of garbage.

@boazsegev
Copy link

internally we already use an array so it still seems inefficient.

IMHO, I would suggest that the internal workings of the Rack framework / middleware should not effect the Rack specification (server-side).

no other header can contain "\n"

This isn't correct.

According to the HTTP specification (section 4.2):

Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma. The order in which header fields with the same field-name are received is therefore significant to the interpretation of the combined field value, and thus a proxy MUST NOT change the order of these field values when a message is forwarded.
There are a number of headers that allow multiple values. HTTP allows all of these headers to be written using multiple (repeated) headers.

For example the transfer-encoding header may be written as:

transfer-encoding: gzip
transfer-encoding: chunked

The links header is another good example for a header that may be sent more than once.

link: styles.example.com
link: js.example.com

There was a similar discussion in the node community here, where is the x-robots-tag header was discussed.

@ioquatix
Copy link
Member Author

It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma

set-cookie is the only header that I'm aware of that is an exception to this rule. Therefore, it's the only header where a comma is not a sufficient separator, and thus "\n" was chosen.

@boazsegev
Copy link

@ioquatix , I apologize. I think I wasn't clear about what I meant to communicate, which may have caused us to go in a circle.

Yes, I agree with you that other headers could (probably) be unified to a single line.

However, right now, applications have a choice - applications are allowed to indicate to the server that they wish to send multiple values either as a comma separated list or as a separate headers.

Right now, servers also have a choice. They can concat \n using commas, or they can respect the applications preference and emit new headers.

However, I believe that this proposed change will break existing code for applications that chose to use \n instead of using a commas in headers such as Link - a choice which is currently a valid choice.

I also believe that the Rack specification should allow developers to make these choices.

@matthewd
Copy link
Contributor

I also believe that the Rack specification should allow developers to make these choices.

+1, FWIW. To me Rack-the-spec is, hand-wavily, a fairly direct Ruby encoding of the HTTP protocol.. and it feels a bit too low-level & abstract to be forcing this sort of opinion ("there are two different ways to send this, but the spec says they're the same, so you may only do this one") onto applications.

The linked discussion around x-robots-tag seems to support the idea that while set-cookie might be the only header allowed to violate the "multiple headers == comma-separated values" rule, things are less clear-cut in the world of non-standard headers. Preventing an application from generating a valid HTTP response, by design, because the RFC says a different form is semantically equivalent... seems to stretch the bounds of reasonability.

@ioquatix
Copy link
Member Author

The implications of the current implementation are that we:

  1. Produce a lot of string garbage when dealing with headers like set-cookie and,
  2. Servers must split all header values by "\n" even if it's irrelevant in practice which is a non-trivial O(N) where N is the byte size of the response headers, thus inefficient.

It's also sufficiently complicated that Puma recently got the implementation wrong when trying to fix the HTTP/1 header vulnerability.

fairly direct Ruby encoding of the HTTP protocol..

At a high level, maybe, but the reality is a hash isn't sufficient to represent the headers. If you are creating a proxy, a hash will not allow you to transparently proxy the request.

I draw your attention to https://tools.ietf.org/html/rfc7230#section-3.2.2 which states:

A sender MUST NOT generate multiple header fields with the same field
name in a message unless either the entire field value for that
header field is defined as a comma-separated list [i.e., #(values)]
or the header field is a well-known exception (as noted below).

A recipient MAY combine multiple header fields with the same field
name into one "field-name: field-value" pair, without changing the
semantics of the message, by appending each subsequent field value to
the combined field value in order, separated by a comma. The order
in which header fields with the same field name are received is
therefore significant to the interpretation of the combined field
value; a proxy MUST NOT change the order of these field values when
forwarding a message.

  Note: In practice, the "Set-Cookie" header field ([RFC6265]) often
  appears multiple times in a response message and does not use the
  list syntax, violating the above requirements on multiple header
  fields with the same name.  Since it cannot be combined into a
  single field-value, recipients ought to handle "Set-Cookie" as a
  special case while processing header fields.  (See Appendix A.2.3
  of [Kri2001] for details.)

So what maps directly to the underlying protocol is:

  1. A header is a string.
  2. A header is an ordered list of strings, which are sent as individual values.

Cases in (1) don't benefit from newline encoding, semantically there is no need, and (2) needs some more advanced encoding. So we impose the cost of a badly specified set-cookie header (which is the only one specifically called out in the RFC) on every other header. It seems like a bad choice to me personally, if you care about efficient server implementation, both memory and execution time.

@boazsegev
Copy link

Servers must split all header values by "\n" even if it's irrelevant in practice

I agree.

But this will also be the situation in the future - unless we are willing to introduce breaking changes.

Without breaking changes, servers will always need to test strings (even strings nested within arrays) for a hidden \n delimiter.

In fact, for security reasons, servers might always need to test for \n anyway.

@ioquatix
Copy link
Member Author

ioquatix commented Feb 29, 2020

Header values can’t contain control characters according to the RFCs. So putting newlines should be invalid, although HTTP/2 HPACK defines header values as a series of octets so it at least won’t break the underlying protocol.

So, in a way, looking through the lens of HTTP/2, any string should be a valid value, even one containing newlines.

In fact, for security reasons, servers might always need to test for \n anyway.

That's the bug which was discussed earlier, but it only affects HTTP/1 in practice.

unless we are willing to introduce breaking changes.

Yes, the proposal here is to:

  • Allow header values to be an array.
  • Change HeaderHash to avoid calling .join("\n").

which is a breaking change. I want to actually test if common HTTP/2 servers allow octets or reject them.

@ioquatix
Copy link
Member Author

ioquatix commented Mar 1, 2020

So, I threw together a quick test using HTTP/2:


require 'async'
require 'async/http/internet'

Async do
	internet = Async::HTTP::Internet.new
	
	(0..31).each do |i|
		Async.logger.info("Trying character code #{i}...")
		
		begin
			response = internet.get("https://www.google.com/search?q=kittens", [
				["accept", "text/html"],
				["accept", "bar#{i.chr}baz"]
			])
			
			response.finish
		
			unless response.success?
				Async.logger.warn(response) {"Got #{response.status} from character index #{i}"}
			end
		rescue
			Async.logger.error(self, $!)
		end
	end
ensure
	internet&.close
end

This tries character codes 0..31 in a header value. Google only rejected \r (13) and \n (10). All other values in the header were accepted. In addition, \r and \n caused stream errors which is a low level error, not a failed request (e.g. status code 4xx or 5xx).

That being said, falcon headers, via http/2, are binary safe and it is ok to send a header containing \r or \n via HTTP/2, even if it could be considered against the spec (e.g. HTTP/1 RFC).

a fairly direct Ruby encoding of the HTTP protocol..

So, coming back to this point, headers according to HTTP/1 don't allow "\n" so it was a suitable delimiter, but in HTTP/2, it is not in violation of the spec, since HPACK explicitly states it encodes octets. Therefore, I think our usage of it, while it might have been okay at one point, is not fully conforming to the features of HTTP/2 - however in practice it might still be okay.

@boazsegev
Copy link

I think our usage of it, while it might have been okay at one point, is not fully conforming to the features of HTTP/2 - however in practice it might still be okay

I think that in practice, it should still be okay... also, I suspect '\n' and '\r' shouldn't be allowed even in HTTP/2.

Although the HTTP/2 transport encoding allows any octet to be securely encoded, IMHO, the octet restrictions still apply, as they are inherited from HTTP/1 according these paragraphs in the HTTP/2 RFC 7540:

HTTP/2 is intended to be as compatible as possible with current uses of HTTP. This means that, from the application perspective, the features of the protocol are largely unchanged. To achieve this, all request and response semantics are preserved, although the syntax of conveying those semantics has changed.

Thus, the specification and requirements of HTTP/1.1 Semantics and Content [RFC7231], Conditional Requests [RFC7232], Range Requests [RFC7233], Caching [RFC7234], and Authentication [RFC7235] are applicable to HTTP/2. Selected portions of HTTP/1.1 Message Syntax and Routing [RFC7230], such as the HTTP and HTTPS URI schemes, are also applicable in HTTP/2, but the expression of those semantics for this protocol are defined in the sections below.

Also, AFAIK, HTTP/2 requests and responses should be "translatable" back to HTTP/1, which means that octets that prevent a request from being translated to HTTP/1 is malformed.

Having said that, you raised a valid point that future versions of the Rack specification might want to switch to a different approach (one that might break existing middleware) 👍

@ioquatix
Copy link
Member Author

ioquatix commented Mar 1, 2020

All good points.

The HPACK spec explicitly allows octets. The HTTP/2 RFC as you state suggest the semantics should be compatible with HTTP/1. So, yes, I don't think you are wrong, but I also don't think I'm wrong.

That being said, encoding headers using "\n" to separate fields is still inefficient and superfluous in almost all headers which support normal comma concatenation. So, I still think that we should change this behaviour.

Do you know how Node.JS handles this?

@boazsegev
Copy link

Do you know how Node.JS handles this?

The node.js API has a response object that offers the setHeader member function that accepts a Hash with header name-values pairs, where values can be an array (if setting multiple values)...

But node.js doesn't have to deal with backwards compatibility.

Side-Note 🤔:

It might not be a bad idea to transition to a node.js compatible approach. This could allow some code porting and trans-compilations to leverage existing code in Ruby (or allow Ruby developers to code node.js applications using Ruby).

encoding headers using "\n" to separate fields is still inefficient and superfluous

I never disagreed. I think the Array is the better solution (it's what I use in the facil.io C framework)... my only concern is the end-result in a world that expects backwards compatibility.

@ioquatix
Copy link
Member Author

ioquatix commented Mar 2, 2020

Well, according to the spec, the response headers should only respond to #each. So any middleware which is adjusting it must make an entirely new set of headers, or assume it's a hash. That being said, most middleware is using Rack::Utils::HeaderHash which hides the implementation details of the response headers. So, I don't think we'd be breaking backwards compatibility much, since users either need to construct a hash themselves or use a HeaderHash. We should probably consider promoting Rack::Utils::HeaderHash as the optimal way to modify/set/add headers. If we assume that's already the case in 2.x, I don't see it as much of a compatibility issue for 3.x.

@boazsegev
Copy link

users either need to construct a hash themselves or use a HeaderHash. We should probably consider promoting Rack::Utils::HeaderHash as the optimal way to modify/set/add headers.

As a server implementor, I would much prefer if the Rack specification would not require me to use the Rack gem.

Sure, applications might have Rack as a dependency - however, the server is currently blind to the Rack gem. The server only concerns itself with the details of the specification.

@ioquatix
Copy link
Member Author

Thanks everyone for the great discussion. I'm going to have a go at implementing this as I think it will be a major advantage for server performance and simplicity. I agree we don't want users to depend on any specific implementation of Rack for this functionality.

@ioquatix
Copy link
Member Author

ioquatix commented Mar 2, 2022

Okay it was merged.

@ioquatix ioquatix closed this as completed Mar 2, 2022
dentarg added a commit to dentarg/sinatra that referenced this issue Dec 28, 2022
dentarg added a commit to dentarg/sinatra that referenced this issue Dec 30, 2022
dentarg added a commit to dentarg/sinatra that referenced this issue Feb 13, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Feb 24, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Mar 4, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Mar 5, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Apr 11, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue May 15, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue May 16, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Aug 7, 2023
geoffharcourt pushed a commit to commonlit/sinatra that referenced this issue Nov 6, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Dec 23, 2023
dentarg added a commit to dentarg/sinatra that referenced this issue Jan 2, 2024
dentarg added a commit to dentarg/sinatra that referenced this issue Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants