New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wsgi: Work around CPython bug when parsing non-ASCII headers #574
base: master
Are you sure you want to change the base?
Conversation
Under CPython 3, when an out-of-spec client sends non-ascii header *names*, email.feedparser halts parsing and assumes that the non-ASCII header must be part of a message body. This is despite the fact that http.client's parse_headers (which is called by the server protocol's parse_request) has already determined the boundary between headers and body and has *only* sent the headers to be parsed. This causes the first such header and *all* subsequent headers to be silently ignored. See also: https://bugs.python.org/issue37093 Under CPython 2, httplib would happily parse non-ASCII headers so long as there was a colon in the header line. As a result, py2 applications may have been written that not only allowed but even encouraged the use of UTF-8 in user-defined header names and values. Support such applications in moving to py3 by checking for a payload on the parsed headers; if found, parse it for more headers. A few things worth pointing out about this: - The parsing does not handle line folding, but our code didn't handle this well on py2 either. Abort parsing. - Header lines without a colon will also abort parsing, but this is maybe preferable to py2's behavior where the offending line is interpreted as the separator between headers and body and is silently discarded, and the request is allowed to continue. At least on py3, the body will start after the first blank line rather than part way through the (bad) headers. - Building a WSGI environment normally involves upper-casing the header names, which should be safe due to their case-insensitivity, but it gets more complicated when considering non-ASCII headers: * While WSGI requires that the header names and values be interpreted as Latin-1 on py3, that isn't necessarily the encoding preferred by the application. * Even if the application wants Latin-1, upper-casing some Latin-1-encodable code points yields a code point that is not Latin-1-encodable, and so should not be used in a WSGI environment. So, preserve the existing py2 behavior on py3: Only upper-case 'a'-'z' Drive-by: Be more explicit about when we're branching because of py2/3 differences so when we eventiually drop support for py2, we can remove the old path with confidence.
I'm not entirely sure whether we ought to be the ones doing this. It's not so bad doing it from the application, especially if it's already providing its own There's still potential for trouble if headers like |
Hmm. Well,
😞 I'm not sure what to do with this. In thinking about it some more, I wonder if it'd be tolerable to swap out FWIW, it looks like other python http servers (it's been a bit, but i seem to remember checking cherrypy, gunicorn, tornado, twisted...) generally handle their own http parsing rather than leaning on stdlib... but that doesn't seem great, either. |
headers = [h.split(':', 1) for h in headers] | ||
else: | ||
headers = self.headers._headers | ||
payload = self.headers.get_payload() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh jeeze -- apparently this may return a str
or a list
of one or more messages (if Content-Type
is message/rfc822
).
😞
if ct is None: | ||
ct_was_none = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ct is None: | |
ct_was_none = True | |
ct_was_none = ct is None | |
if ct_was_none: |
else: | ||
ct_was_none = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: | |
ct_was_none = False |
@tipabu hey, sorry it was left without attention for long.
We have now established that non-ASCII byte in field name makes it invalid header/message. Proper action is to reject such request with HTTP 400 Bad Request. On questions from your code comments:
IMHO we should not try to reconstruct invalid header from broken parser. Love this stuff, keep it coming. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #574 +/- ##
=====================================
- Coverage 46% 46% -1%
=====================================
Files 81 81
Lines 7976 7976
Branches 1365 1365
=====================================
- Hits 3724 3723 -1
- Misses 3996 3997 +1
Partials 256 256
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Under CPython 3, when an out-of-spec client sends non-ascii header names,
email.feedparser
halts parsing and assumes that the non-ASCII header must be part of a message body. This is despite the fact thathttp.client
's parse_headers (which is called by the server protocol'sparse_request
) has already determined the boundary between headers and body and has only sent the headers to be parsed. This causes the first such header and all subsequent headers to be silently ignored.See also: https://bugs.python.org/issue37093
Under CPython 2,
httplib
would happily parse non-ASCII headers sincerfc822
just looks for a colon in the header line. As a result, py2 applications may have been written that not only allowed but even encouraged the use of, say, UTF-8 in user-defined header names and values.Support such applications in moving to py3 by checking for a payload on the parsed headers; if found, parse it for more headers. A few things worth pointing out about this:
The parsing does not handle line folding, but our code didn't handle this well on py2 either. Abort parsing.
Header lines without a colon will also abort parsing, but this is maybe preferable to py2's behavior where the offending line is interpreted as the separator between headers and body and is silently discarded, and the request is allowed to continue. At least on py3, the body will start after the first blank line rather than part way through the (bad) headers.
Building a WSGI environment normally involves upper-casing the header names, which should be safe due to their case-insensitivity, but it gets more complicated when considering non-ASCII headers:
So, preserve the existing py2 behavior on py3: Only upper-case
a
-z
Drive-by: Be more explicit about when we're branching because of py2/3 differences so when we eventually drop support for py2, we can remove the old path with confidence.