Common Pitfalls When Writing Parsers
Here are some of the more common parser development mistakes that creep into even advanced pyparsing parsers (under construction - more details to come):
Word("start")
should beLiteral("start")
Word(printables + " \n")
will match everythingOneOrMore(Word(alphas)) + "end"
fails to match the terminating "end"identifier = Word(alphanums + "_")
should beWord(alphas + '_', alphanums + '_')
identifier = Word(alphanums + "._")
should bedelimitedList(Word(alphas, alphanums + '_'), '.', combine=True)
- Defining a real number expression as
real_number = Word(nums + ".")
orreal_number = Word(nums + "." + nums)
- Defining a real number expression as
real_number = Word(nums) + "." + Word(nums)
Literal("if")
should beKeyword("if")
if "if" must be distinguished from "ifactor"Literal("if not")
should beLiteral("if") + Literal("not")
- Using
forward_expr << term1 | term2
does not parseterm2
- Using
scan_string
with an expression using negative lookahead (~
operator)
It is easy to misinterpret the meaning of the Word
class this way, but Word("start")
takes the "start" argument and accepts any contiguous group of characters (as in a 'word') that are found in the string "start". So it will match any of:
start
starts
art
ratatattat
To match the actual string "start", you should use the Literal
or Keyword
class. (Literal
will match "start" even if it is part of a longer word, like "startup"; Keyword
will only match "start" as its own word.)
Sometimes, in an expression intended to match multiple words with intervening spaces, you will see Word(printables + " ")
. As previously mentioned, Word
uses the input
string as the characters to accept when parsing a contiguous character group. By including a space, this expression will accept:
<- a blank space
aaa
a word
more than one word
multiple words 1000
If the "1000" value in the last example is actually intended to be a separate parsed expression, it will not get matched, because it will have been parsed and consumed as part
of the catch-all Word(printables + " ")
expression. To resolve this particular expression, it should be replaced with OneOrMore(Word(alphas))
.
This is a common problem:
body_word = Word(alphas)
word_section = "start" + OneOrMore(body_word) + "end"
The problem here is that the terminating "end" also matches body_word
, so it will be included in the tokens parsed in OneOrMore(body_word)
, and then fail
when not finding "end" afterward.
The solution is to use the stopOn
argument to OneOrMore
, which tells it that "end" is a terminating sentinel value, and once seen, the OneOrMore
repetition parsing
should stop:
word_section = "start" + OneOrMore(body_word, stopOn=Keyword("end")) + "end"
Another variation on the Word
class pitfall is defining an identifier expression using identifier = Word(alphanums + "_")
. So often I see people write:
identifier = Word(alphanums+"_")
which works, but also matches any of the following:
57
321_
7_8_999
456abc
Instead, use the two-argument form of the Word
class, in which the first argument gives the allowable leading characters, and the second argument gives the
allowable body characters:
identifier = Word(alphas + "_", alphanums + "_").setName("identifier")
This enforces that the character group only allows leading alpha or "_" characters, while the body may also include numeric digits.
identifier = Word(alphanums + "._")
should be delimitedList(Word(alphas, alphanums + '_'), '.', combine=True)
I recommend delimitedList
with combine=True
for dot-qualified variable names such as "namespace.var.attribute". It is tempting to modify the
identifier expression from before by just adding a "." to the allowable body characters:
identifier = Word(alphas + "_", alphanums + "_.")
While this still enforces that the identifier starts with an alpha or "_"
, it also accepts:
x..y......z....
Using this:
identifier = Word(alphas + "_", alphanums + "_").setName("identifier")
qualified_identifier = delimitedList(identifier, ".", combine=True).setName("qualified_identifier")
is the preferred form.
Defining a real number expression as real_number = Word(nums + ".")
or real_number = Word(nums + "." + nums)
It is an easy step to go from parsing an integer using Word(nums)
to parsing a real number using Word(nums + ".")
. But as we saw in the previous item, the
Word
class does not restrict the presence or quantity of any of the characters found in its initializing string. Word(nums + ".")
will match:
.0.
.
......
as valid numbers. Instead, use pyparsing_common.real
or pyparsing_common.number
.
(Word(nums + "." + nums)
looks like it will enforce the presences of only a single ".", and numbers before and after the decimal. But this is just string concatenation, and is
equivalent to Word("0123456789.0123456789")
, and so will behave just the same as (and have the same problems as) Word(nums + ".")
).
This expression correctly parses an input string of a real number, forcing digits before and after a single decimal point, such as:
10.000
3.14
0.00000001
However, because these were defined as separate expressions, pyparsing will return the parsed tokens as:
['10', '.', '000']
['3', '.', '14']
['0', '.', '00000001']
Also, because pyparsing will accept whitespace between expressions, it will also accept:
10. 000
3 .14
0
.000001
You can suppress this whitespace skipping using the Combine
class:
real_number = Combine(Word(nums) + "." + Word(nums))
This was in fact how early versions of
pyparsing included real number parsing in their examples. However, at parse time, this expression is very slow, and to make things worse, it is usually at a terminal level
in a parser, and so would be parsed (or attempted to parse) many, many times. I now recommend using the pre-defined expressions in pyparsing_common
, real
or number
(which also include parse actions to do the conversion from str to float), but if you must define your own, use the Regex
class:
real_number = Regex(r"\d+\.\d+").setName("real number")
TBD
Defining a Literal
or Keyword
with a string that has an embedded space will defeat pyparsing's default whitespace skipping behavior.
Literal("if not")
will match:
if not
but will not match:
if not
if /* embedded C style comment */ not
if
not
By using Literal("if") + Literal("not")
instead, whitespace and comments will be parseable.
Note that using the Literal
class will also accept "ifnot" as input. If these must be parsed as two separate words, then define using Keyword("if") + Keyword("not")
.)
forward_expr << term1 | term2
will not define forward_expr
as a MatchFirst
of term1
and term2
(due to <<
having higher operator precedence than |
), and only matches term1
.
Instead, use:
forward_expr <<= term1 | term2
or:
forward_expr << (term1 | term2)
The negative lookahead works in pyparsing
for normal calls to parse_string
, but is inherently limited when used in scan_string
or its two related methods transform_string
and search_string
(which are just thin wrappers around scan_string
).
Say you have a grammar to parse identifiers, that don't start with keywords, and input
and print
are keywords.
text = "input x y print x y"
keyword = one_of("print input", as_keyword=True)
identifier = ~keyword + Word(alphas, alphanums)
parse_string does what we want
result = (keyword + Group(ZeroOrMore(identifier))).parse_string(text)
print(result)
['input', ['x', 'y']]
But not so with scan_string. Scanning a character at a time will not match 'input', but will match 'nput'
print(identifier.search_string(text))
[['nput'], ['x'], ['y'], ['rint'], ['x'], ['y']]
Upcase identifiers will upcase more than we want:
identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))
iNPUTXY pRINTXY
Making our transformer recognize keywords, but not transform them is one solution.
identifier = Word(alphas, alphanums)
identifier.add_parse_action(pyparsing_common.upcase_tokens)
transformer = keyword | identifier
print(transformer.transform_string(text))
input X Y print X Y
Adding WordStart
before the negative lookahead is another.
identifier = WordStart() + ~keyword + Word(alphas, alphanums)
print(identifier.search_string(text))
identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))
[['x'], ['y'], ['x'], ['y']]
input X Y print X Y
Adding WordStart()
before the negative lookahead is probably the cleaner solution.