Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wsgi: Work around CPython bug when parsing non-ASCII headers #574

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tipabu
Copy link
Contributor

@tipabu tipabu commented Jun 4, 2019

Under CPython 3, when an out-of-spec client sends non-ascii header names, email.feedparser halts parsing and assumes that the non-ASCII header must be part of a message body. This is despite the fact that http.client's parse_headers (which is called by the server protocol's parse_request) has already determined the boundary between headers and body and has only sent the headers to be parsed. This causes the first such header and all subsequent headers to be silently ignored.

See also: https://bugs.python.org/issue37093

Under CPython 2, httplib would happily parse non-ASCII headers since rfc822 just looks for a colon in the header line. As a result, py2 applications may have been written that not only allowed but even encouraged the use of, say, UTF-8 in user-defined header names and values.

Support such applications in moving to py3 by checking for a payload on the parsed headers; if found, parse it for more headers. A few things worth pointing out about this:

  • The parsing does not handle line folding, but our code didn't handle this well on py2 either. Abort parsing.

  • Header lines without a colon will also abort parsing, but this is maybe preferable to py2's behavior where the offending line is interpreted as the separator between headers and body and is silently discarded, and the request is allowed to continue. At least on py3, the body will start after the first blank line rather than part way through the (bad) headers.

  • Building a WSGI environment normally involves upper-casing the header names, which should be safe due to their case-insensitivity, but it gets more complicated when considering non-ASCII headers:

    • While WSGI requires that the header names and values be interpreted as Latin-1 on py3, that isn't necessarily the encoding preferred by the application.
    • Even if the application wants Latin-1, upper-casing some Latin-1-encodable code points yields a code point that is not Latin-1-encodable, and so should not be used in a WSGI environment.

    So, preserve the existing py2 behavior on py3: Only upper-case a-z

Drive-by: Be more explicit about when we're branching because of py2/3 differences so when we eventually drop support for py2, we can remove the old path with confidence.

Under CPython 3, when an out-of-spec client sends non-ascii header
*names*, email.feedparser halts parsing and assumes that the non-ASCII
header must be part of a message body. This is despite the fact that
http.client's parse_headers (which is called by the server protocol's
parse_request) has already determined the boundary between headers and
body and has *only* sent the headers to be parsed. This causes the first
such header and *all* subsequent headers to be silently ignored.

See also: https://bugs.python.org/issue37093

Under CPython 2, httplib would happily parse non-ASCII headers so long
as there was a colon in the header line. As a result, py2 applications
may have been written that not only allowed but even encouraged the use
of UTF-8 in user-defined header names and values.

Support such applications in moving to py3 by checking for a payload on
the parsed headers; if found, parse it for more headers. A few things
worth pointing out about this:

- The parsing does not handle line folding, but our code didn't handle
  this well on py2 either. Abort parsing.
- Header lines without a colon will also abort parsing, but this is
  maybe preferable to py2's behavior where the offending line is
  interpreted as the separator between headers and body and is silently
  discarded, and the request is allowed to continue. At least on py3,
  the body will start after the first blank line rather than part way
  through the (bad) headers.
- Building a WSGI environment normally involves upper-casing the header
  names, which should be safe due to their case-insensitivity, but it
  gets more complicated when considering non-ASCII headers:
  * While WSGI requires that the header names and values be interpreted
    as Latin-1 on py3, that isn't necessarily the encoding preferred by
    the application.
  * Even if the application wants Latin-1, upper-casing some
    Latin-1-encodable code points yields a code point that is not
    Latin-1-encodable, and so should not be used in a WSGI environment.
  So, preserve the existing py2 behavior on py3: Only upper-case 'a'-'z'

Drive-by: Be more explicit about when we're branching because of py2/3
differences so when we eventiually drop support for py2, we can remove
the old path with confidence.
@tipabu
Copy link
Contributor Author

tipabu commented Jul 14, 2019

I'm not entirely sure whether we ought to be the ones doing this. It's not so bad doing it from the application, especially if it's already providing its own HttpProtocol anyway: openstack/swift@76fde892#diff-391ba67a14720da1635b51be427fa134

There's still potential for trouble if headers like Content-Length or Transfer-Encoding get dropped... but maybe we're better off just waiting for python/cpython#13788 or something like it to land upstream?

@tipabu
Copy link
Contributor Author

tipabu commented Oct 3, 2019

Hmm. Well,

😞

I'm not sure what to do with this. In thinking about it some more, I wonder if it'd be tolerable to swap out email.feedparser.headerRE with something more permissive... it'd certainly be a shorter change.

FWIW, it looks like other python http servers (it's been a bit, but i seem to remember checking cherrypy, gunicorn, tornado, twisted...) generally handle their own http parsing rather than leaning on stdlib... but that doesn't seem great, either.

headers = [h.split(':', 1) for h in headers]
else:
headers = self.headers._headers
payload = self.headers.get_payload()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh jeeze -- apparently this may return a str or a list of one or more messages (if Content-Type is message/rfc822).

😞

Comment on lines 676 to +677
if ct is None:
ct_was_none = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ct is None:
ct_was_none = True
ct_was_none = ct is None
if ct_was_none:

Comment on lines +682 to +683
else:
ct_was_none = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
ct_was_none = False

@temoto
Copy link
Member

temoto commented Jul 30, 2020

@tipabu hey, sorry it was left without attention for long.

  • Current HTTP/1.1 RFC is 7230
  • RFC7230#3.2 Header Fields
    field-name = token
  • RFC7230#3.2.6 Field Value Components
    token = 1*tchar
    tchar = (redated symbols like !#$-_) / DIGIT / ALPHA
    ; any VCHAR, except delimiters
  • RFC7230#1.2 Syntax Notation

    The following core rules are included by reference, as defined in
    [RFC5234], Appendix B.1: ALPHA (letters), CR (carriage return), CRLF
    (CR LF), CTL (controls), DIGIT (decimal 0-9), DQUOTE (double quote),
    HEXDIG (hexadecimal 0-9/A-F/a-f), HTAB (horizontal tab), LF (line
    feed), OCTET (any 8-bit sequence of data), SP (space), and VCHAR (any
    visible [USASCII] character).

  • RFC5234#B.1
    ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

We have now established that non-ASCII byte in field name makes it invalid header/message. Proper action is to reject such request with HTTP 400 Bad Request.

On questions from your code comments:

  • Should we support line folding?
    No. RFC7230#3.2.4

    Historically, HTTP header field values could be extended over
    multiple lines by preceding each extra line with at least one space
    or horizontal tab (obs-fold). This specification deprecates such
    line folding except within the message/http media type (Section 8.3.1).
    A sender MUST NOT generate a message that includes
    line folding (i.e., that has any field-value that contains a match to
    the obs-fold rule) unless the message is intended for packaging
    within the message/http media type.
    A server that receives an obs-fold in a request message that is not
    within a message/http container MUST either reject the message by
    sending a 400 (Bad Request), preferably with a representation
    explaining that obsolete line folding is unacceptable, or replace
    each received obs-fold with one or more SP octets prior to
    interpreting the field value or forwarding the message downstream.

  • Should we 400 a bad header line?
    It's the right action, yes. If you want to go extra mile - make EVENTLET_WSGI_NONASCII_HEADER_NAME=strict feature switch with option to ignore invalid field or something else.

IMHO we should not try to reconstruct invalid header from broken parser. email.feedparser was already wrong name for parsing HTTP and now it doesn't work correctly, it's time we go separate roads. :-) Do you know decent HTTP parser in Python?

Love this stuff, keep it coming.

Copy link

codecov bot commented Dec 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5b04426) 46% compared to head (944d50a) 46%.
Report is 117 commits behind head on master.

Additional details and impacted files
@@          Coverage Diff          @@
##           master   #574   +/-   ##
=====================================
- Coverage      46%    46%   -1%     
=====================================
  Files          81     81           
  Lines        7976   7976           
  Branches     1365   1365           
=====================================
- Hits         3724   3723    -1     
- Misses       3996   3997    +1     
  Partials      256    256           
Flag Coverage Δ
ipv6 15% <ø> (ø)
py27epolls ?
py27poll 50% <ø> (-1%) ⬇️
py27selects 49% <ø> (+<1%) ⬆️
py34epolls 42% <ø> (-1%) ⬇️
py34poll 42% <ø> (-1%) ⬇️
py34selects 42% <ø> (ø)
py35epolls 42% <ø> (-1%) ⬇️
py35poll 42% <ø> (-1%) ⬇️
py35selects 42% <ø> (+<1%) ⬆️
py36epolls 42% <ø> (ø)
py36poll 42% <ø> (+<1%) ⬆️
py36selects 42% <ø> (+<1%) ⬆️
py37epolls 42% <ø> (ø)
py37poll 42% <ø> (+<1%) ⬆️
py37selects 42% <ø> (-1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants