wsgi: Work around CPython bug when parsing non-ASCII headers #574

tipabu · 2019-06-04T18:47:49Z

Under CPython 3, when an out-of-spec client sends non-ascii header names, email.feedparser halts parsing and assumes that the non-ASCII header must be part of a message body. This is despite the fact that http.client's parse_headers (which is called by the server protocol's parse_request) has already determined the boundary between headers and body and has only sent the headers to be parsed. This causes the first such header and all subsequent headers to be silently ignored.

See also: https://bugs.python.org/issue37093

Under CPython 2, httplib would happily parse non-ASCII headers since rfc822 just looks for a colon in the header line. As a result, py2 applications may have been written that not only allowed but even encouraged the use of, say, UTF-8 in user-defined header names and values.

Support such applications in moving to py3 by checking for a payload on the parsed headers; if found, parse it for more headers. A few things worth pointing out about this:

The parsing does not handle line folding, but our code didn't handle this well on py2 either. Abort parsing.
Header lines without a colon will also abort parsing, but this is maybe preferable to py2's behavior where the offending line is interpreted as the separator between headers and body and is silently discarded, and the request is allowed to continue. At least on py3, the body will start after the first blank line rather than part way through the (bad) headers.
Building a WSGI environment normally involves upper-casing the header names, which should be safe due to their case-insensitivity, but it gets more complicated when considering non-ASCII headers:
- While WSGI requires that the header names and values be interpreted as Latin-1 on py3, that isn't necessarily the encoding preferred by the application.
- Even if the application wants Latin-1, upper-casing some Latin-1-encodable code points yields a code point that is not Latin-1-encodable, and so should not be used in a WSGI environment.
So, preserve the existing py2 behavior on py3: Only upper-case a-z

Drive-by: Be more explicit about when we're branching because of py2/3 differences so when we eventually drop support for py2, we can remove the old path with confidence.

Under CPython 3, when an out-of-spec client sends non-ascii header *names*, email.feedparser halts parsing and assumes that the non-ASCII header must be part of a message body. This is despite the fact that http.client's parse_headers (which is called by the server protocol's parse_request) has already determined the boundary between headers and body and has *only* sent the headers to be parsed. This causes the first such header and *all* subsequent headers to be silently ignored. See also: https://bugs.python.org/issue37093 Under CPython 2, httplib would happily parse non-ASCII headers so long as there was a colon in the header line. As a result, py2 applications may have been written that not only allowed but even encouraged the use of UTF-8 in user-defined header names and values. Support such applications in moving to py3 by checking for a payload on the parsed headers; if found, parse it for more headers. A few things worth pointing out about this: - The parsing does not handle line folding, but our code didn't handle this well on py2 either. Abort parsing. - Header lines without a colon will also abort parsing, but this is maybe preferable to py2's behavior where the offending line is interpreted as the separator between headers and body and is silently discarded, and the request is allowed to continue. At least on py3, the body will start after the first blank line rather than part way through the (bad) headers. - Building a WSGI environment normally involves upper-casing the header names, which should be safe due to their case-insensitivity, but it gets more complicated when considering non-ASCII headers: * While WSGI requires that the header names and values be interpreted as Latin-1 on py3, that isn't necessarily the encoding preferred by the application. * Even if the application wants Latin-1, upper-casing some Latin-1-encodable code points yields a code point that is not Latin-1-encodable, and so should not be used in a WSGI environment. So, preserve the existing py2 behavior on py3: Only upper-case 'a'-'z' Drive-by: Be more explicit about when we're branching because of py2/3 differences so when we eventiually drop support for py2, we can remove the old path with confidence.

tipabu · 2019-07-14T05:06:27Z

I'm not entirely sure whether we ought to be the ones doing this. It's not so bad doing it from the application, especially if it's already providing its own HttpProtocol anyway: openstack/swift@76fde892#diff-391ba67a14720da1635b51be427fa134

There's still potential for trouble if headers like Content-Length or Transfer-Encoding get dropped... but maybe we're better off just waiting for python/cpython#13788 or something like it to land upstream?

tipabu · 2019-10-03T20:40:31Z

Hmm. Well,

The issue wasn't as easy to fix in application code as I'd originally hoped, and
poses a security concern because of the request-smuggling opportunities.

😞

I'm not sure what to do with this. In thinking about it some more, I wonder if it'd be tolerable to swap out email.feedparser.headerRE with something more permissive... it'd certainly be a shorter change.

FWIW, it looks like other python http servers (it's been a bit, but i seem to remember checking cherrypy, gunicorn, tornado, twisted...) generally handle their own http parsing rather than leaning on stdlib... but that doesn't seem great, either.

tipabu · 2020-02-12T05:45:34Z

eventlet/wsgi.py

            headers = [h.split(':', 1) for h in headers]
+        else:
+            headers = self.headers._headers
+            payload = self.headers.get_payload()


Oh jeeze -- apparently this may return a str or a list of one or more messages (if Content-Type is message/rfc822).

😞

temoto · 2020-07-30T21:38:13Z

eventlet/wsgi.py

        if ct is None:
+            ct_was_none = True


Suggested change

if ct is None:

ct_was_none = True

ct_was_none = ct is None

if ct_was_none:

temoto · 2020-07-30T21:38:46Z

eventlet/wsgi.py

+        else:
+            ct_was_none = False


Suggested change

else:

ct_was_none = False

temoto · 2020-07-30T23:02:19Z

@tipabu hey, sorry it was left without attention for long.

Current HTTP/1.1 RFC is 7230
RFC7230#3.2 Header Fields
field-name = token
RFC7230#3.2.6 Field Value Components
token = 1*tchar
tchar = (redated symbols like !#$-_) / DIGIT / ALPHA
; any VCHAR, except delimiters
RFC7230#1.2 Syntax Notation

The following core rules are included by reference, as defined in
[RFC5234], Appendix B.1: ALPHA (letters), CR (carriage return), CRLF
(CR LF), CTL (controls), DIGIT (decimal 0-9), DQUOTE (double quote),
HEXDIG (hexadecimal 0-9/A-F/a-f), HTAB (horizontal tab), LF (line
feed), OCTET (any 8-bit sequence of data), SP (space), and VCHAR (any
visible [USASCII] character).
RFC5234#B.1
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

We have now established that non-ASCII byte in field name makes it invalid header/message. Proper action is to reject such request with HTTP 400 Bad Request.

On questions from your code comments:

Should we support line folding?
No. RFC7230#3.2.4

Historically, HTTP header field values could be extended over
multiple lines by preceding each extra line with at least one space
or horizontal tab (obs-fold). This specification deprecates such
line folding except within the message/http media type (Section 8.3.1).
A sender MUST NOT generate a message that includes
line folding (i.e., that has any field-value that contains a match to
the obs-fold rule) unless the message is intended for packaging
within the message/http media type.
A server that receives an obs-fold in a request message that is not
within a message/http container MUST either reject the message by
sending a 400 (Bad Request), preferably with a representation
explaining that obsolete line folding is unacceptable, or replace
each received obs-fold with one or more SP octets prior to
interpreting the field value or forwarding the message downstream.
Should we 400 a bad header line?
It's the right action, yes. If you want to go extra mile - make EVENTLET_WSGI_NONASCII_HEADER_NAME=strict feature switch with option to ignore invalid field or something else.

IMHO we should not try to reconstruct invalid header from broken parser. email.feedparser was already wrong name for parsing HTTP and now it doesn't work correctly, it's time we go separate roads. :-) Do you know decent HTTP parser in Python?

Love this stuff, keep it coming.

codecov · 2023-12-19T15:04:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5b04426) 46% compared to head (944d50a) 46%.
Report is 117 commits behind head on master.

Additional details and impacted files

@@          Coverage Diff          @@
##           master   #574   +/-   ##
=====================================
- Coverage      46%    46%   -1%     
=====================================
  Files          81     81           
  Lines        7976   7976           
  Branches     1365   1365           
=====================================
- Hits         3724   3723    -1     
- Misses       3996   3997    +1     
  Partials      256    256

Flag	Coverage Δ
ipv6	`15% <ø> (ø)`
py27epolls	`?`
py27poll	`50% <ø> (-1%)`	⬇️
py27selects	`49% <ø> (+<1%)`	⬆️
py34epolls	`42% <ø> (-1%)`	⬇️
py34poll	`42% <ø> (-1%)`	⬇️
py34selects	`42% <ø> (ø)`
py35epolls	`42% <ø> (-1%)`	⬇️
py35poll	`42% <ø> (-1%)`	⬇️
py35selects	`42% <ø> (+<1%)`	⬆️
py36epolls	`42% <ø> (ø)`
py36poll	`42% <ø> (+<1%)`	⬆️
py36selects	`42% <ø> (+<1%)`	⬆️
py37epolls	`42% <ø> (ø)`
py37poll	`42% <ø> (+<1%)`	⬆️
py37selects	`42% <ø> (-1%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tipabu commented Feb 12, 2020

View reviewed changes

temoto reviewed Jul 30, 2020

View reviewed changes

tipabu mentioned this pull request Dec 13, 2023

Fix security issue related to RFC 9112 #826

Merged

itamarst added important bug labels Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wsgi: Work around CPython bug when parsing non-ASCII headers #574

wsgi: Work around CPython bug when parsing non-ASCII headers #574

tipabu commented Jun 4, 2019

tipabu commented Jul 14, 2019

tipabu commented Oct 3, 2019

tipabu Feb 12, 2020

temoto Jul 30, 2020

temoto Jul 30, 2020

temoto commented Jul 30, 2020

codecov bot commented Dec 19, 2023 •

edited

wsgi: Work around CPython bug when parsing non-ASCII headers #574

Are you sure you want to change the base?

wsgi: Work around CPython bug when parsing non-ASCII headers #574

Conversation

tipabu commented Jun 4, 2019

tipabu commented Jul 14, 2019

tipabu commented Oct 3, 2019

tipabu Feb 12, 2020

Choose a reason for hiding this comment

temoto Jul 30, 2020

Choose a reason for hiding this comment

temoto Jul 30, 2020

Choose a reason for hiding this comment

temoto commented Jul 30, 2020

codecov bot commented Dec 19, 2023 • edited

Codecov Report

codecov bot commented Dec 19, 2023 •

edited