Skip to content

Design Notes

Paul McGuire edited this page Dec 12, 2022 · 2 revisions

What is Group for?

When building up a pyparsing parser, the resulting strings are returned as a ParseResults object, with the parsed strings in a flat sequence. This is the case even if you build up the parser from sub-expressions, and add them into a larger expression using '+' and '|' operators.

Imagine we are writing a parser for a simplified form of a Python def statement. Here is a BNF:

def_stmt ::= 'def' identifier '(' arg_spec, ... ')' -> return_type_spec ':'
arg_spec ::= identifier
return_type_spec ::= identifier
identifier ::= <normal Python identifier>

This translates almost directly to pyparsing (working from bottom to top):

import pyparsing as pp
ppc = pp.common

# define punctuation and keywords
LPAR, RPAR, COLON = map(pp.Suppress, "():")
ARROW = pp.Suppress("->")
DEF = pp.Keyword("def").suppress()

identifier = ppc.identifier
arg_spec = identifier()
return_type_spec = identifier()
def_stmt = (DEF + identifier
            + LPAR + pp.Opt(pp.delimitedList(arg_spec)) + RPAR
            + ARROW + return_type_spec + COLON)     

We can test this parser using the run_tests() method:

def_stmt.run_tests("""\
    def sin() -> float:
    def abs(x) -> float:
    def pad(s, min, max) -> str:
    """)

And we get our parsed results:

def sin() -> float:
['sin', 'float']

def abs(x) -> float:
['abs', 'x', 'float']

def pad(s, min, max) -> str:
['pad', 's', 'min', 'max', 'str']

This is okay, we can certainly parcel this out into function name, arguments, and return type. But now imagine we add support for optional types for the arguments as well:

arg_spec ::= identfier [':' type_spec]
type_spec ::= identfier

So that pads definition could be expanded to def pad(s: str, min: int, max:int) -> str:. We accomplish this just be changing the definition of arg_spec to:

type_spec = identifier()
arg_spec = identifier() + pp.Opt(COLON + type_spec)

And parsing the type annotated definition of pad now gives:

def pad(s: str, min: int, max: int) -> str:
['pad', 's', 'str', 'min', 'int', 'max', 'int', 'str']

SUCCESS! Well, sort of. We can see that we've parsed all the individual parts. But the results are still just a sequence of the found strings. We could remove the suppressing of all the punctuation, and get the separating ()'s and ,'s, but then we will have done little more than tokenize the statement, and now we need logic to walk all the tokens.

Also, since the types are optional, we can't reliably walk this list of strings, assuming that after the function name will come the name/type pairs for each argument - some arguments might have omitted the type.

This is where Group comes in. If we enclose arg_spec in a Group, this will define name/type structures for the def statement. Changing arg_spec to:

arg_spec = pp.Group(identifier() + pp.Opt(COLON + type_spec))

With this change, the parser gives us a ParseResults containing:

['pad', ['s', 'str'], ['min', 'int'], ['max', 'int'], 'str']

Now that we can group the results using Group, we can go further and gather all the parsed argument specs into a sub-structure, so we don't have to unpack the results by grabbing leading name and trailing return type, and assuming the rest are the arguments.

So next we change the def statement itself to group the optional arguments:

def_stmt = (DEF + identifier
            + LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec))) + RPAR
            + ARROW + return_type_spec + COLON)

Giving us this structured result:

['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']

If we run this parser against our original list of function definitions, we will see another benefit: every function always returns exactly 3 elements. Even those that have no defined arguments, such as the one for sin, will include an empty list for the second element.

And lastly, to make it easier to access the individual parts of the results, we'll add results names to the grammar definition:

def_stmt = (DEF + identifier("function_name")
            + LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec)))("args") + RPAR
            + ARROW + return_type_spec("returns") + COLON)

And we get:

['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']
- args: [['s', 'str'], ['min', 'int'], ['max', 'int']]
  [0]:
    ['s', 'str']
  [1]:
    ['min', 'int']
  [2]:
    ['max', 'int']
- function_name: 'pad'
- returns: 'str'

You can still access the contents of the ParseResults as if they were items in a list, but you can also get to them by name:

result = def_stmt.parse_string("def pad(s: str, min: int, max: int) -> str:")
print(result[0])             # prints 'pad'
print(result.function_name)  # also prints 'pad'

We should do the same thing for the parts of each argument spec:

arg_spec = pp.Group(identifier("name") + pp.Opt(COLON + type_spec("type")))

To get:

['pad', [['s', 'str'], ['min', 'int'], ['max', 'int']], 'str']
- args: [['s', 'str'], ['min', 'int'], ['max', 'int']]
  [0]:
    ['s', 'str']
    - name: 's'
    - type: 'str'
  [1]:
    ['min', 'int']
    - name: 'min'
    - type: 'int'
  [2]:
    ['max', 'int']
    - name: 'max'
    - type: 'int'
- function_name: 'pad'
- returns: 'str'

So we could then write:

for arg in result.args:
    print(arg.name, arg.type)

Group has done one more thing for us. Note that all the grouped argument specs have maintained their separate argument names and types. If we remove the Group() enclosing arg_spec, this is the kind of result we would get:

['pad', ['s', 'str', 'min', 'int', 'max', 'int'], 'str']
- args: ['s', 'str', 'min', 'int', 'max', 'int']
  - name: 'max'
  - type: 'int'
- function_name: 'pad'
- returns: 'str'

Saving results names into a ParseResults is very similar to assigning key-values to a Python dict: if duplicated keys are used, only the last assigned value is stored. When the arg_specs are ungrouped, the name and type only have the values for the final max argument. So by grouping the argument specs, we maintain the separate names and types for each argument in the list of arguments.

In summary, Group serves two purposes:

  • it adds structure
  • it scopes results names

By using Group and results names, your parser will generate results that are much easier for your post-parsing code to process.

Here are some ideas for other enhancements to this parser:

  • add an optional section after the return type to list possible exceptions that could be raised (similar to "throws" clause in Java method definitions)
  • expand arg_type to handle multiple types, such as def abs(x: int|float) -> float:

Here is the complete program for this parser:

import pyparsing as pp
ppc = pp.common

# define punctuation and keywords (they get suppressed because they are only
# needed at parse time, they don't contain useful information for post-parse work)
LPAR, RPAR, COLON = map(pp.Suppress, "():")
ARROW = pp.Suppress("->")
DEF = pp.Keyword("def").suppress()

# define the sub-expressions
identifier = ppc.identifier
type_spec = identifier()
arg_spec = pp.Group(identifier("name") + pp.Opt(COLON + type_spec("type")))
return_type_spec = type_spec()

# define the full def statement expression
def_stmt = (DEF + identifier("function_name")
            + LPAR + pp.Group(pp.Opt(pp.delimitedList(arg_spec)))("args") + RPAR
            + ARROW + return_type_spec("returns") + COLON)

# try it out!
result = def_stmt.parse_string("def pad(s: str, min: int, max: int) -> str:")
print(result[0])             # prints 'pad'
print(result.function_name)  # also prints 'pad'

for arg in result.args:
    print(arg.name, arg.type)

def_stmt.run_tests("""\
    def sin() -> float:
    def abs(x) -> float:
    def pad(s, min, max) -> str:
    def abs(x: float) -> float:
    def pad(s: str, min: int, max: int) -> str:
    """)