Skip to content

Performance Tips

Paul McGuire edited this page Jan 17, 2019 · 3 revisions

While it is convenient and easy to write parsers with pyparsing, these parsers are not always lightning fast. Even so, there are a few things that can be done in a pyparsing parser to help move things along.

  1. Packrat parsing. For recursive grammars (including those defined with infixNotation), packrat parsing is probably the easiest and most effective speedup you can do. Packrat parsing is a special form of memoization (function argument cacheing) specifically for parser applications. If a single parsing construct appears multiple times in a grammar such that it can get evaluated at the same position in the incoming text, then the parse results are not re-evaluated, but are retrieved from the packrat cache.

    To activate packrat parsing, insert this line immediately after importing pyparsing:

     import pyparsing
     pyparsing.ParserElement.enablePackrat()
    

    If pyparsing is imported in multiple modules, it is only necessary to enable packratting once.

  2. Convert simple Combine() expressions to use Regex() (especially for terminals, like serial numbers or numerics). In early versions of pyparsing examples, there were definitions of floating point numbers that followed the pyparsing idiom of building up a parser from smaller components:

     real_number = pp.Combine(pp.Word(pp.nums) + '.' + pp.Word(pp.nums) 
                              + pp.Optional(pp.Word('eE', exact=1) 
                                            + pp.Optional(pp.oneOf("+ -")) 
                                            + pp.Word(pp.nums)))
    

    This is a lot of expression evaluation, and many function calls (very expensive in Python), occurring in a sub-expression that is likely to be evaluated many times within a larger parser.

    This same expression can be parsed using a regex, and can be included in your parser with a Regex expression:

     real_number = pp.Regex(r"\d+\.\d+([Ee][+-]?\d+)?")
    

    This is much more efficient in Python terms, as there is just a single expression to evaluate, requiring just a single set of expression-related function calls. Making this conversion in the provided Verilog parser gained a significant performance speedup. (It is not necessary to go overboard on this - there is no gain in converting pp.Word(pp.alphas) to pp.Regex(r"[a-zA-Z]"), since Word uses its own compiled regular expression internally.)

  3. Use the pre-defined expressions in pyparsing_common. This follows naturally from the previous tip. There are several pre-defined expressions in pyparsing_common that have already been optimized for use in parsers, and so using them instead of your own versions. Most of these come with attached parse actions to do conversions as well, so that parsing an integer string will result not in the substring of digits but in a Python int. Here are some of the expressions available in pyparsing_common (a full list is in the online docs):

    • numeric forms (integers, reals, scientific notation)
    • common programming identifiers
    • network addresses (MAC, IPv4, IPv6)
    • ISO8601 dates and datetime
    • UUID
    • comma-separated list
  4. Use MatchFirst instead of Or. Both classes implement the concept of alternation. They both work with two or more expressions, and return the parse results from one of those that matches the current parse position in the input text. The difference is that MatchFirst (created using the | operator) will parse each expression in turn, and then immediately return the result from one that matches. Or (created using the '^' operator) will evaluate all the given expressions, and then return the results from the longest match (or in the case of a tie, the first expression used to get the longest match). This can be time-consuming, especially if the matches are not really ambiguous or overlapping. Consider this grammar:

     element ::= alphabetic_group | numeric_digit_group | '.' | ',' | '!'
    

    When parsing input text, there is no way that more than one of these alternatives will possibly match, so using an Or here is unnecessary and wasteful. Using the '|' operator to define a pyparsing expression for element is the correct approach.

    Here is another example:

     keyword ::= 'if' | 'then' | 'else' | 'elseif' | 'begin' | 'end' | 'ifonly'
    

    This could be an ambiguous expression if constructed using pyparsing Literal classes. Literal parses text without regard to any surrounding text, so Literal('if') | Literal('ifonly') will never match "ifonly", because the MatchFirst will parse the leading 'if' and then return. There are several workaround for this:

    • order longest-to-shortest, or most-specific to least-specific - You can manually construct your MatchFirst expression putting the longest words first, so that there is no risk of masking a short word with a longer one. This might cause some inefficiencies if one of your keywords is alongkeywordthatishardlyeverused. Since it would be at the front of the line in the MatchFirst's list of expressions, it would be tested in many cases, even though it would rarely match.

    • use oneOf - oneOf is a helper method provided by pyparsing to take a list of strings and reorder them as necessary to avoid ambiguity, but to otherwise keep them in the provided order. In this way, the parser writer still has some control over the general sequence of evaluation of the alternatives if overlapping is not a problem.

    • use Keyword - Instead of using Literal for your individual keyword items, use the pyparsing Keyword class. Unlike Literal, Keyword does look at the surrounding text, such that Keyword("if") would not match an input word of "ifonly" or "iffy". Since Keyword is not confused in this way, then the words in the list could be ordered in any way, and MatchFirst would not choose the wrong one. This has the added benefit of allowing the parser writer to order the keywords in a preferred evaluation order, if some a priori knowledge of the likely frequency of keywords is known.

    • use Or - Sometimes it can't be helped. But at least you have some alternatives that can buy some better performance, especially for expressions that are defined at a low-level in the grammar (meaning that they are likely to be used frequently).