Skip to content

Common Pitfalls When Writing Parsers

Paul McGuire edited this page Jan 23, 2022 · 5 revisions

Here are some of the more common parser development mistakes that creep into even advanced pyparsing parsers (under construction - more details to come):

Word("start") should be Literal("start")

It is easy to misinterpret the meaning of the Word class this way, but Word("start") takes the "start" argument and accepts any contiguous group of characters (as in a 'word') that are found in the string "start". So it will match any of:

start
starts
art
ratatattat

To match the actual string "start", you should use the Literal or Keyword class. (Literal will match "start" even if it is part of a longer word, like "startup"; Keyword will only match "start" as its own word.)

Word(printables + " \n") will match everything

Sometimes, in an expression intended to match multiple words with intervening spaces, you will see Word(printables + " "). As previously mentioned, Word uses the input string as the characters to accept when parsing a contiguous character group. By including a space, this expression will accept:

    <-  a blank space
aaa
a word
more than one word
multiple words 1000

If the "1000" value in the last example is actually intended to be a separate parsed expression, it will not get matched, because it will have been parsed and consumed as part of the catch-all Word(printables + " ") expression. To resolve this particular expression, it should be replaced with OneOrMore(Word(alphas)).

OneOrMore(Word(alphas)) + "end" fails to match the terminating "end"

This is a common problem:

body_word = Word(alphas)
word_section = "start" + OneOrMore(body_word) + "end"

The problem here is that the terminating "end" also matches body_word, so it will be included in the tokens parsed in OneOrMore(body_word), and then fail when not finding "end" afterward.

The solution is to use the stopOn argument to OneOrMore, which tells it that "end" is a terminating sentinel value, and once seen, the OneOrMore repetition parsing should stop:

word_section = "start" + OneOrMore(body_word, stopOn=Keyword("end")) + "end"

identifier = Word(alphanums + "_") should be Word(alphas + '_', alphanums + '_')

Another variation on the Word class pitfall is defining an identifier expression using identifier = Word(alphanums + "_"). So often I see people write:

identifier = Word(alphanums+"_")

which works, but also matches any of the following:

57
321_
7_8_999
456abc

Instead, use the two-argument form of the Word class, in which the first argument gives the allowable leading characters, and the second argument gives the allowable body characters:

identifier = Word(alphas + "_", alphanums + "_").setName("identifier")

This enforces that the character group only allows leading alpha or "_" characters, while the body may also include numeric digits.

identifier = Word(alphanums + "._") should be delimitedList(Word(alphas, alphanums + '_'), '.', combine=True)

I recommend delimitedList with combine=True for dot-qualified variable names such as "namespace.var.attribute". It is tempting to modify the identifier expression from before by just adding a "." to the allowable body characters:

identifier = Word(alphas + "_", alphanums + "_.")

While this still enforces that the identifier starts with an alpha or "_", it also accepts:

x..y......z....

Using this:

identifier = Word(alphas + "_", alphanums + "_").setName("identifier")
qualified_identifier = delimitedList(identifier, ".", combine=True).setName("qualified_identifier")

is the preferred form.

Defining a real number expression as real_number = Word(nums + ".") or real_number = Word(nums + "." + nums)

It is an easy step to go from parsing an integer using Word(nums) to parsing a real number using Word(nums + "."). But as we saw in the previous item, the Word class does not restrict the presence or quantity of any of the characters found in its initializing string. Word(nums + ".") will match:

.0.
.
......

as valid numbers. Instead, use pyparsing_common.real or pyparsing_common.number.

(Word(nums + "." + nums) looks like it will enforce the presences of only a single ".", and numbers before and after the decimal. But this is just string concatenation, and is equivalent to Word("0123456789.0123456789"), and so will behave just the same as (and have the same problems as) Word(nums + ".")).

Defining a real number expression as real_number = Word(nums) + "." + Word(nums)

This expression correctly parses an input string of a real number, forcing digits before and after a single decimal point, such as:

10.000
3.14
0.00000001

However, because these were defined as separate expressions, pyparsing will return the parsed tokens as:

['10', '.', '000']
['3', '.', '14']
['0', '.', '00000001']

Also, because pyparsing will accept whitespace between expressions, it will also accept:

10.     000
3    .14
0
   .000001

You can suppress this whitespace skipping using the Combine class:

real_number = Combine(Word(nums) + "." + Word(nums))

This was in fact how early versions of pyparsing included real number parsing in their examples. However, at parse time, this expression is very slow, and to make things worse, it is usually at a terminal level in a parser, and so would be parsed (or attempted to parse) many, many times. I now recommend using the pre-defined expressions in pyparsing_common, real or number (which also include parse actions to do the conversion from str to float), but if you must define your own, use the Regex class:

real_number = Regex(r"\d+\.\d+").setName("real number")

Literal("if") should be Keyword("if") if "if" must be distinguished from "ifactor"

TBD

Literal("if not") should be Literal("if") + Literal("not")

Defining a Literal or Keyword with a string that has an embedded space will defeat pyparsing's default whitespace skipping behavior.

Literal("if not") will match:

if not

but will not match:

if    not
if /* embedded C style comment */ not
if
 not

By using Literal("if") + Literal("not") instead, whitespace and comments will be parseable. Note that using the Literal class will also accept "ifnot" as input. If these must be parsed as two separate words, then define using Keyword("if") + Keyword("not").)

Using forward_expr << term1 | term2 does not parse term2

forward_expr << term1 | term2 will not define forward_expr as a MatchFirst of term1 and term2 (due to << having higher operator precedence than |), and only matches term1.

Instead, use:

forward_expr <<= term1 | term2

or:

forward_expr << (term1 | term2)

Using scan_string with an expression using negative lookahead (~ operator)

The negative lookahead works in pyparsing for normal calls to parse_string, but is inherently limited when used in scan_string or its two related methods transform_string and search_string (which are just thin wrappers around scan_string).

Say you have a grammar to parse identifiers, that don't start with keywords, and input and print are keywords.

text = "input x y print x y"

keyword = one_of("print input", as_keyword=True)
identifier = ~keyword + Word(alphas, alphanums)

parse_string does what we want

result = (keyword + Group(ZeroOrMore(identifier))).parse_string(text)
print(result)

['input', ['x', 'y']]

But not so with scan_string. Scanning a character at a time will not match 'input', but will match 'nput'

print(identifier.search_string(text))

[['nput'], ['x'], ['y'], ['rint'], ['x'], ['y']]

Upcase identifiers will upcase more than we want:

identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))

iNPUTXY pRINTXY

Making our transformer recognize keywords, but not transform them is one solution.

identifier = Word(alphas, alphanums)
identifier.add_parse_action(pyparsing_common.upcase_tokens)
transformer = keyword | identifier
print(transformer.transform_string(text))

input X Y print X Y

Adding WordStart before the negative lookahead is another.

identifier = WordStart() + ~keyword + Word(alphas, alphanums)
print(identifier.search_string(text))
identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))

[['x'], ['y'], ['x'], ['y']]
input X Y print X Y

Adding WordStart() before the negative lookahead is probably the cleaner solution.