Skip to content

Commit

Permalink
parse auth headers using pyparsing instead of regexp
Browse files Browse the repository at this point in the history
Fixes CPU burn DoS by cubic complexity of whitespace matching in
WWW_AUTH_RELAXED (default) regexp.
  • Loading branch information
temoto committed Jan 27, 2021
1 parent 33090ab commit bd9ee25
Show file tree
Hide file tree
Showing 12 changed files with 579 additions and 1,032 deletions.
493 changes: 107 additions & 386 deletions python2/httplib2/__init__.py

Large diffs are not rendered by default.

61 changes: 61 additions & 0 deletions python2/httplib2/auth.py
@@ -0,0 +1,61 @@
import base64
import re

import pyparsing as pp

This comment has been minimized.

Copy link
@ptmcg

ptmcg Mar 27, 2021

Contributor

pyparsing 2.4.7 is Py2 and Py3 compatible. It may be worth splitting the parser out to a separate module that both your P2 and Py3 versions import.

This comment has been minimized.

Copy link
@temoto

temoto Mar 29, 2021

Author Member

Thanks, we're committed to drop Python2 support altogether and then will use something like pyparsing>=3 to put lesser version constraint on projects.


from .error import *

UNQUOTE_PAIRS = re.compile(r"\\(.)")
unquote = lambda s, l, t: UNQUOTE_PAIRS.sub(r"\1", t[0][1:-1])

# https://tools.ietf.org/html/rfc7235#section-1.2
# https://tools.ietf.org/html/rfc7235#appendix-B
tchar = "!#$%&'*+-.^_`|~" + pp.nums + pp.alphas
token = pp.Word(tchar).setName("token")
token68 = pp.Combine(pp.Word("-._~+/" + pp.nums + pp.alphas) + pp.ZeroOrMore("=")).setName("token68")

This comment has been minimized.

Copy link
@ptmcg

ptmcg Mar 27, 2021

Contributor

pyparsing is not exactly fast. I believe you would get better performance using a pp.Word('=') in place of the pp.ZeroOrMore("=") . Note that this will use a regex internally.
Also, pyparsing's whitespace skipping will accept token68's that look like "0 = = =". If that is not permissable, then use leaveWhitespace() on the second term in this expression (either ZeroOrMore or Word).

This comment has been minimized.

Copy link
@temoto

temoto Mar 29, 2021

Author Member

@ptmcg thanks. Does it need to be pp.Optional(pp.Word("=")) to accept string with zero length suffix?

This comment has been minimized.

Copy link
@ptmcg

ptmcg Mar 29, 2021

Contributor

Yes.


quoted_string = pp.dblQuotedString.copy().setName("quoted-string").setParseAction(unquote)
auth_param_name = token.copy().setName("auth-param-name").addParseAction(pp.downcaseTokens)
auth_param = auth_param_name + pp.Suppress("=") + (token ^ quoted_string)

This comment has been minimized.

Copy link
@ptmcg

ptmcg Mar 27, 2021

Contributor

I don't think there is any ambiguity between token and quoted_string such that you need "^" operator. I think you can safely replace with "|", and gain some parse-time performance.

This comment has been minimized.

Copy link
@temoto

temoto Mar 29, 2021

Author Member

This works, improvement is 940us -> 730us (~20%) for params.parseString of a realistic digest auth params. Thank you.

params = pp.Dict(pp.delimitedList(pp.Group(auth_param)))

scheme = token("scheme")
challenge = scheme + (token68("token") ^ params("params"))

This comment has been minimized.

Copy link
@ptmcg

ptmcg Mar 27, 2021

Contributor

'^' operators can also be expensive, since they always check all alternatives, and then select the longest. I think you will get suitable and slightly better behavior if you use (params("params") | token68("token")) instead. (Depends on how often you get a params in the input string).

This comment has been minimized.

Copy link
@temoto

temoto Mar 29, 2021

Author Member

This works only with params first and successfully produces wrong results with token first. Maybe syntax is genuinely ambiguous or my implementation is bad. Improvement is 1220us -> 630us (~50%) for challenge.parseString on synthetic complex line. Thank you.


authentication_info = params.copy()
www_authenticate = pp.delimitedList(pp.Group(challenge))


def _parse_authentication_info(headers, headername="authentication-info"):
"""https://tools.ietf.org/html/rfc7615
"""
header = headers.get(headername, "").strip()
if not header:
return {}
try:
parsed = authentication_info.parseString(header)
except pp.ParseException as ex:
# print(ex.explain(ex))
raise MalformedHeader(headername)

return parsed.asDict()


def _parse_www_authenticate(headers, headername="www-authenticate"):
"""Returns a dictionary of dictionaries, one dict per auth_scheme."""
header = headers.get(headername, "").strip()
if not header:
return {}
try:
parsed = www_authenticate.parseString(header)
except pp.ParseException as ex:
# print(ex.explain(ex))
raise MalformedHeader(headername)

retval = {
challenge["scheme"].lower(): challenge["params"].asDict()
if "params" in challenge
else {"token": challenge.get("token")}
for challenge in parsed
}
return retval
48 changes: 48 additions & 0 deletions python2/httplib2/error.py
@@ -0,0 +1,48 @@
# All exceptions raised here derive from HttpLib2Error
class HttpLib2Error(Exception):
pass


# Some exceptions can be caught and optionally
# be turned back into responses.
class HttpLib2ErrorWithResponse(HttpLib2Error):
def __init__(self, desc, response, content):
self.response = response
self.content = content
HttpLib2Error.__init__(self, desc)


class RedirectMissingLocation(HttpLib2ErrorWithResponse):
pass


class RedirectLimit(HttpLib2ErrorWithResponse):
pass


class FailedToDecompressContent(HttpLib2ErrorWithResponse):
pass


class UnimplementedDigestAuthOptionError(HttpLib2ErrorWithResponse):
pass


class UnimplementedHmacDigestAuthOptionError(HttpLib2ErrorWithResponse):
pass


class MalformedHeader(HttpLib2Error):
pass


class RelativeURIError(HttpLib2Error):
pass


class ServerNotFoundError(HttpLib2Error):
pass


class ProxiesUnavailableError(HttpLib2Error):
pass

2 comments on commit bd9ee25

@ptmcg
Copy link
Contributor

@ptmcg ptmcg commented on bd9ee25 Mar 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad pyparsing was useful to you in resolving this. I had to fix similar regex parsing issues (runaway backtracking) in some of pyparsing's internal regexen, so I can definitely empathize!

@temoto
Copy link
Member Author

@temoto temoto commented on bd9ee25 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean improvement on realistic and synthetic complex headers is around 50%. Thank you very much.

Please sign in to comment.