Regression in 3.0.2, 3.0.3, and 3.0.4 #323

nijel · 2021-10-28T08:13:24Z

In translate-toolkit we use pyparsing for parsing Windows RC files.

I did port the code to 3.0.1 and everything worked fine. On 3.0.2 and 3.0.3 we get testsuite failures (see https://github.com/translate/translate/runs/4028529279?check_suite_focus=true).

The guilty commit is 4ab17bb, reverting it makes it work again.

The text was updated successfully, but these errors were encountered:

ptmcg · 2021-10-29T15:33:11Z

I've downloaded translate-toolkit and tried to create a minimal extract of your parser to test it in isolation, but am unable to reproduce any parsing error on 3.0.2. Without a repro example, I can't really tell which parts of pyparsing need to change, or know when I have fixed the problem.

Here is the code I'm using to test your parser:

# https://github.com/pyparsing/pyparsing/issues/323
# repo: https://github.com/translate/translate/
# parser: https://github.com/translate/translate/blob/master/translate/storage/rc.py

from pyparsing import *


RC_SOURCE = r"""
#include "other_file.h" // This must be ignored
LANGUAGE LANG_ENGLISH, SUBLANG_DEFAULT
/////////////////////////////////////////////////////////////////////////////
//
// Dialog
//
IDD_REGGHC_DIALOG DIALOGEX 0, 0, 211, 191
STYLE DS_SETFONT | DS_MODALFRAME | DS_FIXEDSYS | WS_POPUP | WS_VISIBLE | WS_CAPTION | WS_SYSMENU
EXSTYLE WS_EX_APPWINDOW
CAPTION "License dialog"
FONT 8, "MS Shell Dlg", 0, 0, 0x1
BEGIN
    PUSHBUTTON      "Help",ID_HELP,99,162,48,15
    PUSHBUTTON      "Close",IDCANCEL,151,162,48,15
    PUSHBUTTON      "Activate instalation",IDC_BUTTON1,74,76,76,18
    CTEXT           "My very good program",IDC_STATIC1,56,21,109,19,SS_SUNKEN
    CTEXT           "You can use it without registering it",IDC_STATIC,35,131,128,19,SS_SUNKEN
    PUSHBUTTON      "Offline",IDC_OFFLINE,149,108,42,13
    PUSHBUTTON      "See license",IDC_LICENCIA,10,162,85,15
    RTEXT           "If you don't have internet, please use magic.",IDC_STATIC,23,105,120,18
    ICON            IDR_MAINFRAME,IDC_STATIC,44,74,20,20
    CTEXT           "Use your finger to activate the program.",IDC_ACTIVADA,17,50,175,17
    ICON            IDR_MAINFRAME1,IDC_STATIC6,18,19,20,20
END
MainMenu MENU
{
    POPUP "&Debug"
    {
        MENUITEM "&Memory usage", ID_MEMORY
        POPUP
        {
            MENUITEM SEPARATOR
            MENUITEM "&Walk data heap", ID_WALK_HEAP
        }
    }
}
STRINGTABLE
BEGIN
ID_T_1 "Hello"
END
"""

POFILE = """
#: DIALOGEX.IDD_REGGHC_DIALOG.CAPTION
msgid "License dialog"
msgstr "Licenční dialog"
"""


def rc_statement():
    """
    Generate a RC statement parser that can be used to parse a RC file
    :rtype: pyparsing.ParserElement
    """

    one_line_comment = "//" + rest_of_line

    comments = c_style_comment ^ one_line_comment

    precompiler = Word("#", alphanums) + rest_of_line

    language_definition = (
        "LANGUAGE"
        + Word(alphas + "_").set_results_name("language")
        + Optional("," + Word(alphas + "_").set_results_name("sublanguage"))
    )

    block_start = (Keyword("{") | Keyword("BEGIN")).set_name("block_start")
    block_end = (Keyword("}") | Keyword("END")).set_name("block_end")

    reserved_words = block_start | block_end

    name_id = ~reserved_words + Word(alphas, alphanums + "_").set_name("name_id")

    numbers = Word(nums)

    integerconstant = numbers ^ Combine("0x" + numbers)

    constant = Combine(
        Optional(Keyword("NOT")) + (name_id | integerconstant),
        adjacent=False,
        join_string=" ",
    )
    autoname_elements()

    combined_constants = delimited_list(constant, "|")

    concatenated_string = OneOrMore(quoted_string)

    block_options = Optional(
        SkipTo(Keyword("CAPTION"), fail_on=block_start)("pre_caption")
        + Keyword("CAPTION")
        + quoted_string("caption")
    ) + SkipTo(block_start)("post_caption")

    undefined_control = (
        Group(
            name_id.set_results_name("id_control")
            + delimited_list(
                concatenated_string ^ constant ^ numbers ^ Group(combined_constants)
            ).set_results_name("values_")
        )
        | comments
    )

    block = block_start + ZeroOrMore(undefined_control)("controls") + block_end

    dialog = (
        name_id("block_id")
        + (Keyword("DIALOGEX") | Keyword("DIALOG"))("block_type")
        + block_options
        + block
    )

    string_table = Keyword("STRINGTABLE")("block_type") + block_options + block

    menu_item = Keyword("MENUITEM")("block_type") + (
        pyparsing_common.comma_separated_list("values_") | Keyword("SEPARATOR")
    )

    popup_block = Forward()

    popup_block <<= Group(
        Keyword("POPUP")("block_type")
        + Optional(quoted_string("caption"))
        + block_start
        + ZeroOrMore(Group(menu_item | popup_block))("elements")
        + block_end
    )("popups*")

    menu = (
        name_id("block_id")
        + Keyword("MENU")("block_type")
        + block_options
        + block_start
        + ZeroOrMore(popup_block)
        + block_end
    )
    autoname_elements()
    return comments ^ precompiler ^ language_definition ^ dialog ^ string_table ^ menu

import pyparsing
print(pyparsing.__version_info__)

parser = rc_statement()
parser.create_diagram("translate_toolkit_rc_parser.html")

parser.run_tests([
    """\r\nSTRINGTABLE\r\nBEGIN\r\n// Comment\r\nIDS_1 "Copied"\r\nEND\r\n""",
    """\r\nSTRINGTABLE\r\nBEGIN\r\n    // Comment\r\n    IDS_1                   "Zkopirovano"\r\nEND\r\n""",
    ])
print()

result = Group(parser)[...].parse_string(RC_SOURCE, parse_all=True)
result.pprint()

nijel · 2021-10-31T12:07:03Z

I was also not able to extract smaller testcase so far, but here is my observation so far:

The difference is in transform_string result - with 3.0.1 it includes leading \r\n in 3.0.2-3.0.4 it is ommitted
The actual problematic code are following two lines, commenting them out fixes our tests:

pyparsing/pyparsing/core.py

Line 3850 in e269b0a

self.skipWhitespace = all(e.skipWhitespace for e in self.exprs)

pyparsing/pyparsing/core.py

Line 3859 in e269b0a

self.skipWhitespace = all(e.skipWhitespace for e in self.exprs)

ptmcg · 2021-11-01T03:38:54Z

Without being able to reproduce the problem or test case, we are stuck with trial and error. If I remove those lines, other pyparsing tests fail.

Can you restore those lines, but instead delete these lines in Or.parseImpl, and then rerun your tests?

    if all(e.callPreparse for e in self.exprs):
        loc = self.preParse(instring, loc)

https://github.com/pyparsing/pyparsing/blob/master/pyparsing/core.py#L3869-L3870

nijel · 2021-11-01T07:51:52Z

Okay, here is the reproducer:

from pyparsing import Keyword, Word, alphanums, alphas

block_start = (Keyword("{") | Keyword("BEGIN")).set_name("block_start")
block_end = (Keyword("}") | Keyword("END")).set_name("block_end")

reserved_words = block_start | block_end

name_id = ~reserved_words + Word(alphas, alphanums + "_").set_name("name_id")

dialog = name_id("block_id") + (Keyword("DIALOGEX") | Keyword("DIALOG"))("block_type")

string_table = Keyword("STRINGTABLE")("block_type")

parser = dialog ^ string_table

parser.addParseAction(lambda x: x)
result = parser.transformString(
    b"""\r\nSTRINGTABLE\r\nBEGIN\r\n// Comment\r\nIDS_1 "Copied"\r\nEND\r\n""".decode()
)
print(repr(result))
assert result[:2] == "\r\n"

The change you have suggested in Or.parseImpl has no effect here.

ptmcg · 2021-11-01T13:45:26Z

This is a huge help, and reproduces the problem nicely, thanks!

nijel · 2021-11-01T14:51:27Z

It can be probably simplified even more, I just didn't have time to play with it further. Maybe following parser will be enough to trigger this as well:

parser = ~Keyword("BEGIN") ^ Keyword("STRINGTABLE")

ptmcg · 2021-11-01T14:59:06Z

Yep. Even this triggers it:

~Empty() | Empty()

The '~' (generates a NotAny expression), as part of an Or or MatchFirst, triggers the problem. So I have a pretty minimal repro test case to work with now, but I'll also validate against your more complete example once I have a possible fix.

ptmcg · 2021-11-02T09:21:52Z

Ok, I've come up with a fix, BUT it is a change in NotAny that makes me a little anxious. The change passes all my unit tests, but as we have seen in the past week or so, there are many more parser cases out there than are covered by my test suite. So I am a little reluctant to stir things up again. This is the kind of API change that should really go through the pre-release phases.

However, I think I have found a workaround in your parser, that works with the current released version (and should work with all the versions, I think). Replace:

name_id = ~reserved_words + Word(alphas, alphanums + "_").set_name("name_id")

with

name_id = Combine(~reserved_words + Word(alphas, alphanums + "_")).set_name("name_id")

Please give this a try.

Otherwise, I would ask you to pin your pyparsing version to an existing version that works for you, until I can introduce this other change in a 3.1.0 pre-release.

nijel · 2021-11-02T13:44:09Z

The workaround with Combine doesn't work (or at least not with other changes) - it then seems to parse only first letter instead of whole word. It might be a bug in our parser as well, TBH my knowledge of this code is quite limited...

ptmcg · 2021-11-02T15:52:00Z

ok. I won't be able to look at this again for a few days now, so for now you should probably pin to 2.4.7. I would not recommend pinning to 3.0.1, it is broken in some other areas. I'll leave this open so I can revisit - hopefully this weekend.

nijel · 2021-11-03T08:24:43Z

I've already ported the code to 3 API, so pinning to 2.4.7 would need reverting these changes. We will stick with 3.0.1 for now as it works fine in the test suite and I believe it covers most of the use cases.

ptmcg · 2021-11-08T04:09:11Z

I have a modified version of rc.py and po2rc.py that work with pyparsing 3.0.5 (in the attached zip file). I think that there are some issues with the way the negative lookahead created with the ~ operator (generating a NotAny) has changed when used in transform_string. I've reworked the rc grammar to move the negative lookahead into the corresponding ZeroOrMore constructs, where it is behaving correctly now.
modified_rc_files.zip

See pyparsing/pyparsing#323

nijel · 2021-11-08T07:43:13Z

Thanks, that makes sense. I've turned it into translate/translate#4468

See pyparsing/pyparsing#323

This is needed to address bug in the translate-toolkit RC parser, see pyparsing/pyparsing#323

ptmcg · 2021-11-13T09:57:06Z

Can this issue be closed now?

nijel · 2021-11-13T11:49:10Z

For translate-toolkit it is fixed. It's up to you if you consider this big in pyparsing or in our parser

ptmcg · 2021-11-13T16:14:08Z

The negative lookahead works in pyparsing for normal calls to parse_string, but is inherently limited when used in scan_string or its two related methods transform_string and search_string (which are just thin wrappers around scan_string). Say you have a grammar to parse identifiers, that don't start with keywords, and input and print are keywords.

text = "input x y print x y"

keyword = one_of("print input", as_keyword=True)
identifier = ~keyword + Word(alphas, alphanums)

parse_string does what we want

result = (keyword + Group(ZeroOrMore(identifier))).parse_string(text)
print(result)

['input', ['x', 'y']]

But not so with scan_string. Scanning a character at a time will not match 'input', but will match 'nput'

print(identifier.search_string(text))

[['nput'], ['x'], ['y'], ['rint'], ['x'], ['y']]

Upcase identifiers will upcase more than we want:

identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))

iNPUTXY pRINTXY

Making our transformer recognize keywords, but not transform them is one solution.

identifier = Word(alphas, alphanums)
identifier.add_parse_action(pyparsing_common.upcase_tokens)
transformer = keyword | identifier
print(transformer.transform_string(text))

input X Y print X Y

Adding WordStart before the negative lookahead is another.

identifier = WordStart() + ~keyword + Word(alphas, alphanums)
print(identifier.search_string(text))
identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))

[['x'], ['y'], ['x'], ['y']]
input X Y print X Y

With this simple parser, I couldn't come up with an easy example using the stop_on approach that I sent you earlier. After making this write-up, I think adding WordStart() before the negative lookahead might actually be the cleaner solution.

And lastly, why did this work in 2.4.7 but break in 3.0? I think it goes back to how whitespace skipping was changed for the ParseExpression subclasses Or and MatchFirst, so that these expressions would not implicitly override leaveWhitespace() that might be defined in one of their alternative sub-expressions.

So for pyparsing, I'd say it's not so much a bug as a limitation - if anything, the bug is that it is not documented better. I'll probably lift much of this comment into a wiki page or blog post to try to address that.

nijel · 2021-11-14T12:08:51Z

Documenting this is certainly a good approach.

ptmcg · 2022-01-23T09:52:41Z

Documented in the Common Pitfalls wiki page:
https://github.com/pyparsing/pyparsing/wiki/Common-Pitfalls-When-Writing-Parsers#using-scan_string-with-an-expression-using-negative-lookahead--operator

This was referenced Oct 28, 2021

pyparsing upgrade WeblateOrg/weblate#6727

Merged

build(deps): bump pyparsing from 3.0.1 to 3.0.3 translate/translate#4457

Closed

nijel mentioned this issue Nov 2, 2021

build(deps): bump pyparsing from 3.0.1 to 3.0.4 translate/translate#4463

Closed

nijel changed the title ~~Regression in 3.0.2 and 3.0.3~~ Regression in 3.0.2, 3.0.3, and 3.0.4 Nov 2, 2021

nijel pushed a commit to nijel/translate that referenced this issue Nov 8, 2021

rc: Parser fixes for negative lookup

1439a02

See pyparsing/pyparsing#323

nijel mentioned this issue Nov 8, 2021

rc: Parser fixes for negative lookup translate/translate#4468

Merged

nijel pushed a commit to translate/translate that referenced this issue Nov 8, 2021

rc: Parser fixes for negative lookup

5f132ad

See pyparsing/pyparsing#323

nijel added a commit to WeblateOrg/weblate that referenced this issue Nov 8, 2021

deps: Update pyparsing and translate-toolkit

563bbc7

This is needed to address bug in the translate-toolkit RC parser, see pyparsing/pyparsing#323

ptmcg added a commit that referenced this issue Nov 12, 2021

Add tests written when working on #323

85fc927

rossbar mentioned this issue Nov 18, 2021

Regression in v3.0.2 onward #338

Closed

ptmcg closed this as completed Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 3.0.2, 3.0.3, and 3.0.4 #323

Regression in 3.0.2, 3.0.3, and 3.0.4 #323

nijel commented Oct 28, 2021

ptmcg commented Oct 29, 2021

nijel commented Oct 31, 2021

ptmcg commented Nov 1, 2021

nijel commented Nov 1, 2021

ptmcg commented Nov 1, 2021

nijel commented Nov 1, 2021

ptmcg commented Nov 1, 2021

ptmcg commented Nov 2, 2021

nijel commented Nov 2, 2021

ptmcg commented Nov 2, 2021

nijel commented Nov 3, 2021

ptmcg commented Nov 8, 2021

nijel commented Nov 8, 2021 •

edited

ptmcg commented Nov 13, 2021

nijel commented Nov 13, 2021

ptmcg commented Nov 13, 2021 •

edited

nijel commented Nov 14, 2021

ptmcg commented Jan 23, 2022

Regression in 3.0.2, 3.0.3, and 3.0.4 #323

Regression in 3.0.2, 3.0.3, and 3.0.4 #323

Comments

nijel commented Oct 28, 2021

ptmcg commented Oct 29, 2021

nijel commented Oct 31, 2021

ptmcg commented Nov 1, 2021

nijel commented Nov 1, 2021

ptmcg commented Nov 1, 2021

nijel commented Nov 1, 2021

ptmcg commented Nov 1, 2021

ptmcg commented Nov 2, 2021

nijel commented Nov 2, 2021

ptmcg commented Nov 2, 2021

nijel commented Nov 3, 2021

ptmcg commented Nov 8, 2021

nijel commented Nov 8, 2021 • edited

ptmcg commented Nov 13, 2021

nijel commented Nov 13, 2021

ptmcg commented Nov 13, 2021 • edited

nijel commented Nov 14, 2021

ptmcg commented Jan 23, 2022

nijel commented Nov 8, 2021 •

edited

ptmcg commented Nov 13, 2021 •

edited