Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we support looser header name validation? #113

Open
tomchristie opened this issue Oct 16, 2020 · 10 comments
Open

Should we support looser header name validation? #113

tomchristie opened this issue Oct 16, 2020 · 10 comments

Comments

@tomchristie
Copy link
Contributor

tomchristie commented Oct 16, 2020

Closely related to #97.

Prompted by encode/httpx#1363 (comment)

So, h11 currently has stricter-than-urllib3 rules on header name validation...

>>> import httpx
>>> httpx.get("https://club.huawei.com/")
Traceback (most recent call last):
...
httpx.RemoteProtocolError: malformed data

Which is occurring because the response looks like this...

HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Security-Policy: base-uri
Content-Type: text/html; charset=utf-8
Date: Thu, 15 Oct 2020 13:19:33 GMT
Server: CloudWAF
Set-Cookie: HWWAFSESID=a74181602debc465809; path=/
Set-Cookie: HWWAFSESTIME=1602767969615; path=/
Set-Cookie: a3ps_2132_saltkey=yCXrVqdR06Nk5u2PrmLgs9eqlGIpQd9FogV2GL6bxGP3HH2XweRXIeCVny%2BrVDpoOYNLphTU9uVN1HP1%2Fav1bvV2Yrafq%2BXdJR%2BVAVPHizU92ISGAest0dKt7%2FIbdulNYXV0aGtleQ%3D%3D; path=/; secure; httponly
Set-Cookie: a3ps_2132_errorinfo=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; httponly
Set-Cookie: a3ps_2132_errorcode=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; httponly
Set-Cookie: a3ps_2132_auth=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; httponly
Set-Cookie: a3ps_2132_lastvisit=1602764373; expires=Sat, 14-Nov-2020 13:19:33 GMT; Max-Age=2592000; path=/; secure; httponly
Set-Cookie: a3ps_2132_lastact=1602767973%09portal.php%09; expires=Fri, 16-Oct-2020 13:19:33 GMT; Max-Age=86400; path=/; secure; httponly
Set-Cookie: a3ps_2132_currentHwLoginUrl=http%3A%2F%2Fcn.club.vmall.com%2F; expires=Thu, 15-Oct-2020 15:19:33 GMT; Max-Age=7200; path=/; secure; httponly
Transfer-Encoding: chunked
X-XSS-Protection: 1; mode=block
banlist-ip: 0
banlist-uri: 0
get-ban-to-cache-result/portal.php: userdata not support
get-ban-to-cache-result62.31.28.214: userdata not support
result-ip: 0
result-uri: 0

That's not all that unexpected, since it's obviously simply just due to h11 being a wonderfully thoroughly engineered package. And doing a great job of following the relevant specs. However we might(?) want to be following a path of as-lax-as-possible-if-still-parsable on stuff that comes in from the wire, while keeping the constraints on always ensuring spec-compliant outputs. (?)

In practice, if httpx is failing to parse responses like this, then at least some subset of users are going to see behaviour that from their perspective is a regression vs. other HTTP tooling.

What are our thoughts here?

@sigmavirus24
Copy link

So in general, the maxim is "Be liberal in what you accept and conservative in what you send" (or something close to that).

I definitely think there's value in not raising an exception here, that said, these should probably be "quarantined" for lack of a better turn of phrase. I think urllib3 might drop these on the floor (or has a few cases where that happens by virtue of using the standard library's http client) and that's also surprising. A way to signal to users "Hey, these are ... weird, maybe be careful with them" would probably be valuable

@njsmith
Copy link
Member

njsmith commented Dec 10, 2020

So I think the issue you're pointing out is the header named get-ban-to-cache-result/portal.php, which is illegal because / is not permitted in header names?

It looks like urllib3 silently allows this through:

In [7]: [h for h in urllib3.PoolManager().request("GET", "https://club.huawei.com").headers if "/" in h]          
Out[7]: ['get-ban-to-cache-result/portal.php']

It would be nice to know what curl does here, but in a quick check curl was giving me 405 Not Allowed (even if I spoof the User-Agent), so idk what's up with that. Go's http code is also a good reference, but I haven't checked how it handles this.

In Firefox's network debugging tab, it seems to show this header as being silently discarded:

image

If that's what browsers do, then that's a pretty strong argument that we should do it as well. I don't know if there's a WHAT-WG spec for how headers handle this... does anyone else?

@sigmavirus24
Copy link

package main

import (
	"fmt"
	"net/http"
)

func main() {
	resp, _ := http.Get("https://club.huawei.com")
	headerName := "get-ban-to-cache-result/portal.php"
	for key := range resp.Header {
		if key == headerName {
			fmt.Printf("resp.Header.Get(\"%s\") = \"%s\"\n", headerName, resp.Header.Get(headerName))
		}
	}
}

When run produces

resp.Header.Get("get-ban-to-cache-result/portal.php") = "userdata not support"

So it looks like Go allows you to receive that leniently

@njsmith
Copy link
Member

njsmith commented Dec 11, 2020

Digging through the net/http source, it looks like Go does check that outgoing header names match the RFCs, but AFAICT it doesn't do any validation whatsoever on incoming header names. In response.go, readResponse seems to just call the textproto module's ReadMIMEHeader, and that silently skips over lines with empty header names (!), and calls canonicalMIMEHeaderKey, and the latter explicitly passes through invalid header keys unchanged (!).

I also poked around in the WHATWG fetch spec, and it doesn't seem to have any details at all on header parsing yet. This is the relevant subsection: https://fetch.spec.whatwg.org/#http-network-fetch

But it just says:

Follow the relevant requirements from HTTP.

And

Note: The exact layering between Fetch and HTTP still needs to be sorted through and therefore response represents both a response and an HTTP response here.

@njsmith
Copy link
Member

njsmith commented Dec 13, 2020

So I guess there are two axes that we have to make a decision on: first, which characters we're going to handle specially. For example, we clearly need to be more tolerant of /, somehow, but maybe / and \x00 should be treated differently. Also, there are some invalid header names that don't involve bad characters, like the empty string or header names with trailing whitespace. (RFC 7230 explicitly says that servers MUST hard-error on headers with trailing whitespace, proxies MUST hard-error or strip the whitespace, and clients can do whatever.)

And second, for each bad header name, we have a menu of choices for how to handle it. The ones I can think of:

  • Hard error (like we do now)
    • Downside: insufficiently compatible with the real world, at least for some characters.
  • Allow the header through unchanged (like urllib3/curl seem to do)
    • Downside: might let through "bad stuff" that user code isn't prepared to handle. E.g., allowing \x00 to appear in header names seems like a very bad idea.
  • Silently discard the header (like browsers seem to do)
    • Downside: given that HTTP libraries like urllib3/curl allow these headers through maybe there are users who actually rely on getting them?
  • Discard the header by default, but offer some opt-in mode to request access to the "bad" headers
    • Downside: usually the user who hits the problem is like 3 layers of abstraction removed from h11 so passing through opt-in options is not easy.

There are also concerns about "request smuggling". When different HTTP implementations handle edge cases differently, then you can end up in a situation where e.g. your firewall interprets your request as harmless and passes it on to your backend, but then the backend interprets it as something harmful. (This is apparently why RFC 7230 is so worried about trailing whitespace -- I don't know the details, but my guess is that some implementations used to strip the whitespace and other implementations treated it as part of the header name.)

But here's something promising: it looks like nodejs's http parser unconditionally discards invalid header names. (Except that they have a "non-strict" mode that allows embedded spaces in header names, because of something involving how IIS 6 used to work. But hopefully we don't have to care about that any more.) So that's evidence that there might not be much demand for seeing these invalid header names.

One possibility: continue to hard-error on trailing whitespace, but for the other invalid cases silently discard the header.

Another option: only do that in client mode; in server mode continue to hard-error on all invalid headers.

@njsmith
Copy link
Member

njsmith commented Dec 13, 2020

BTW if anyone wants to play with nodejs's http parser, there's a python wrapper here: https://github.com/pallas/pyllhttp

@memst
Copy link
Contributor

memst commented Sep 14, 2021

I would urge to make a decision and go down the path of urllib3 and other libraries that pass parsable headers even if they don't percisely follow the RFC 7230. Often users can't control what response headers the server is sending, but they would still like to process the data. The choice to hard error is currently made on the basis of safety, but people are now using a workaround and direclty overwriting h11._readers.header_field_re, which exposes them to more threats because that regex won't be maintained.

I think that discarding invalid headers is worse than passing them through. It still creates the same problem for users that have to access that part of the request. Even an opt-in option is likely to be unaccessible to the end user who is utilising h11 through other libraries, which might not implement the option.

I think the decision should be made soon. It it's a bad idea to start bypassing security features by modifying the library's internal variables, but the current state leaves the users with no other choice.

@memst
Copy link
Contributor

memst commented Sep 17, 2021

For reference, python's http.client parses all incoming headers according to RFC2822.

@Hultner
Copy link

Hultner commented Nov 24, 2021

This is quite the hurdle for me at the moment. Played around with patching the pattern in h11._readers. header_field_re as @memst "suggested", but just caused more problems. I'm considering patching h11._abnf.token instead to include "?" which is the character a bad server is sending me in one of the headers.

I feel a bit uneasy about doing this but I can't control what the server is sending to me.

Edit: Tried monkey-patching the abnf token value just before sending a request to the misbehaving server but that didn't bite (I guess it's to late to try to temporarily patch it). Did however bite if I changed the token pattern in the library code, but this makes me even more uneasy.

@mnot
Copy link

mnot commented Jul 31, 2023

HTTP defines the syntax of field names as token because that was a convenient sytax to use way back when; it's somewhat arbitrary.

There are characters which are obviously unsafe to allow, like ':', but the comment from @njsmith above gets to the real issue here -- response smuggling. HTTP is designed to allow messages to be handled by multiple implementations, and when those implementations handle messages differently, it can be exploited by attackers.

So, this is a security issue.

And, while the choice of allowable characters for field names is somewhat arbitrary, it's important: implementations need to align on it. Aligning on the standard is not only the most straightforward thing, it's also the safest, because it's unambiguous, stable, and conservative (being more strict is good here).

So my suggestion would be to follow the RFC but allowing explicit loosening, with appropirate warnings about how it can cause security vulnerabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants