Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 3.0.2, 3.0.3, and 3.0.4 #323

Closed
nijel opened this issue Oct 28, 2021 · 18 comments
Closed

Regression in 3.0.2, 3.0.3, and 3.0.4 #323

nijel opened this issue Oct 28, 2021 · 18 comments

Comments

@nijel
Copy link
Contributor

nijel commented Oct 28, 2021

In translate-toolkit we use pyparsing for parsing Windows RC files.

I did port the code to 3.0.1 and everything worked fine. On 3.0.2 and 3.0.3 we get testsuite failures (see https://github.com/translate/translate/runs/4028529279?check_suite_focus=true).

The guilty commit is 4ab17bb, reverting it makes it work again.

@ptmcg
Copy link
Member

ptmcg commented Oct 29, 2021

I've downloaded translate-toolkit and tried to create a minimal extract of your parser to test it in isolation, but am unable to reproduce any parsing error on 3.0.2. Without a repro example, I can't really tell which parts of pyparsing need to change, or know when I have fixed the problem.

Here is the code I'm using to test your parser:

# https://github.com/pyparsing/pyparsing/issues/323
# repo: https://github.com/translate/translate/
# parser: https://github.com/translate/translate/blob/master/translate/storage/rc.py

from pyparsing import *


RC_SOURCE = r"""
#include "other_file.h" // This must be ignored
LANGUAGE LANG_ENGLISH, SUBLANG_DEFAULT
/////////////////////////////////////////////////////////////////////////////
//
// Dialog
//
IDD_REGGHC_DIALOG DIALOGEX 0, 0, 211, 191
STYLE DS_SETFONT | DS_MODALFRAME | DS_FIXEDSYS | WS_POPUP | WS_VISIBLE | WS_CAPTION | WS_SYSMENU
EXSTYLE WS_EX_APPWINDOW
CAPTION "License dialog"
FONT 8, "MS Shell Dlg", 0, 0, 0x1
BEGIN
    PUSHBUTTON      "Help",ID_HELP,99,162,48,15
    PUSHBUTTON      "Close",IDCANCEL,151,162,48,15
    PUSHBUTTON      "Activate instalation",IDC_BUTTON1,74,76,76,18
    CTEXT           "My very good program",IDC_STATIC1,56,21,109,19,SS_SUNKEN
    CTEXT           "You can use it without registering it",IDC_STATIC,35,131,128,19,SS_SUNKEN
    PUSHBUTTON      "Offline",IDC_OFFLINE,149,108,42,13
    PUSHBUTTON      "See license",IDC_LICENCIA,10,162,85,15
    RTEXT           "If you don't have internet, please use magic.",IDC_STATIC,23,105,120,18
    ICON            IDR_MAINFRAME,IDC_STATIC,44,74,20,20
    CTEXT           "Use your finger to activate the program.",IDC_ACTIVADA,17,50,175,17
    ICON            IDR_MAINFRAME1,IDC_STATIC6,18,19,20,20
END
MainMenu MENU
{
    POPUP "&Debug"
    {
        MENUITEM "&Memory usage", ID_MEMORY
        POPUP
        {
            MENUITEM SEPARATOR
            MENUITEM "&Walk data heap", ID_WALK_HEAP
        }
    }
}
STRINGTABLE
BEGIN
ID_T_1 "Hello"
END
"""

POFILE = """
#: DIALOGEX.IDD_REGGHC_DIALOG.CAPTION
msgid "License dialog"
msgstr "Licenční dialog"
"""


def rc_statement():
    """
    Generate a RC statement parser that can be used to parse a RC file
    :rtype: pyparsing.ParserElement
    """

    one_line_comment = "//" + rest_of_line

    comments = c_style_comment ^ one_line_comment

    precompiler = Word("#", alphanums) + rest_of_line

    language_definition = (
        "LANGUAGE"
        + Word(alphas + "_").set_results_name("language")
        + Optional("," + Word(alphas + "_").set_results_name("sublanguage"))
    )

    block_start = (Keyword("{") | Keyword("BEGIN")).set_name("block_start")
    block_end = (Keyword("}") | Keyword("END")).set_name("block_end")

    reserved_words = block_start | block_end

    name_id = ~reserved_words + Word(alphas, alphanums + "_").set_name("name_id")

    numbers = Word(nums)

    integerconstant = numbers ^ Combine("0x" + numbers)

    constant = Combine(
        Optional(Keyword("NOT")) + (name_id | integerconstant),
        adjacent=False,
        join_string=" ",
    )
    autoname_elements()

    combined_constants = delimited_list(constant, "|")

    concatenated_string = OneOrMore(quoted_string)

    block_options = Optional(
        SkipTo(Keyword("CAPTION"), fail_on=block_start)("pre_caption")
        + Keyword("CAPTION")
        + quoted_string("caption")
    ) + SkipTo(block_start)("post_caption")

    undefined_control = (
        Group(
            name_id.set_results_name("id_control")
            + delimited_list(
                concatenated_string ^ constant ^ numbers ^ Group(combined_constants)
            ).set_results_name("values_")
        )
        | comments
    )

    block = block_start + ZeroOrMore(undefined_control)("controls") + block_end

    dialog = (
        name_id("block_id")
        + (Keyword("DIALOGEX") | Keyword("DIALOG"))("block_type")
        + block_options
        + block
    )

    string_table = Keyword("STRINGTABLE")("block_type") + block_options + block

    menu_item = Keyword("MENUITEM")("block_type") + (
        pyparsing_common.comma_separated_list("values_") | Keyword("SEPARATOR")
    )

    popup_block = Forward()

    popup_block <<= Group(
        Keyword("POPUP")("block_type")
        + Optional(quoted_string("caption"))
        + block_start
        + ZeroOrMore(Group(menu_item | popup_block))("elements")
        + block_end
    )("popups*")

    menu = (
        name_id("block_id")
        + Keyword("MENU")("block_type")
        + block_options
        + block_start
        + ZeroOrMore(popup_block)
        + block_end
    )
    autoname_elements()
    return comments ^ precompiler ^ language_definition ^ dialog ^ string_table ^ menu

import pyparsing
print(pyparsing.__version_info__)

parser = rc_statement()
parser.create_diagram("translate_toolkit_rc_parser.html")

parser.run_tests([
    """\r\nSTRINGTABLE\r\nBEGIN\r\n// Comment\r\nIDS_1 "Copied"\r\nEND\r\n""",
    """\r\nSTRINGTABLE\r\nBEGIN\r\n    // Comment\r\n    IDS_1                   "Zkopirovano"\r\nEND\r\n""",
    ])
print()

result = Group(parser)[...].parse_string(RC_SOURCE, parse_all=True)
result.pprint()

@nijel
Copy link
Contributor Author

nijel commented Oct 31, 2021

I was also not able to extract smaller testcase so far, but here is my observation so far:

  • The difference is in transform_string result - with 3.0.1 it includes leading \r\n in 3.0.2-3.0.4 it is ommitted
  • The actual problematic code are following two lines, commenting them out fixes our tests:

self.skipWhitespace = all(e.skipWhitespace for e in self.exprs)

self.skipWhitespace = all(e.skipWhitespace for e in self.exprs)

@ptmcg
Copy link
Member

ptmcg commented Nov 1, 2021

Without being able to reproduce the problem or test case, we are stuck with trial and error. If I remove those lines, other pyparsing tests fail.

Can you restore those lines, but instead delete these lines in Or.parseImpl, and then rerun your tests?

    if all(e.callPreparse for e in self.exprs):
        loc = self.preParse(instring, loc)

https://github.com/pyparsing/pyparsing/blob/master/pyparsing/core.py#L3869-L3870

@nijel
Copy link
Contributor Author

nijel commented Nov 1, 2021

Okay, here is the reproducer:

from pyparsing import Keyword, Word, alphanums, alphas

block_start = (Keyword("{") | Keyword("BEGIN")).set_name("block_start")
block_end = (Keyword("}") | Keyword("END")).set_name("block_end")

reserved_words = block_start | block_end

name_id = ~reserved_words + Word(alphas, alphanums + "_").set_name("name_id")

dialog = name_id("block_id") + (Keyword("DIALOGEX") | Keyword("DIALOG"))("block_type")

string_table = Keyword("STRINGTABLE")("block_type")

parser = dialog ^ string_table

parser.addParseAction(lambda x: x)
result = parser.transformString(
    b"""\r\nSTRINGTABLE\r\nBEGIN\r\n// Comment\r\nIDS_1 "Copied"\r\nEND\r\n""".decode()
)
print(repr(result))
assert result[:2] == "\r\n"

The change you have suggested in Or.parseImpl has no effect here.

@ptmcg
Copy link
Member

ptmcg commented Nov 1, 2021

This is a huge help, and reproduces the problem nicely, thanks!

@nijel
Copy link
Contributor Author

nijel commented Nov 1, 2021

It can be probably simplified even more, I just didn't have time to play with it further. Maybe following parser will be enough to trigger this as well:

parser = ~Keyword("BEGIN") ^ Keyword("STRINGTABLE")

@ptmcg
Copy link
Member

ptmcg commented Nov 1, 2021

Yep. Even this triggers it:

~Empty() | Empty()

The '~' (generates a NotAny expression), as part of an Or or MatchFirst, triggers the problem. So I have a pretty minimal repro test case to work with now, but I'll also validate against your more complete example once I have a possible fix.

@ptmcg
Copy link
Member

ptmcg commented Nov 2, 2021

Ok, I've come up with a fix, BUT it is a change in NotAny that makes me a little anxious. The change passes all my unit tests, but as we have seen in the past week or so, there are many more parser cases out there than are covered by my test suite. So I am a little reluctant to stir things up again. This is the kind of API change that should really go through the pre-release phases.

However, I think I have found a workaround in your parser, that works with the current released version (and should work with all the versions, I think). Replace:

name_id = ~reserved_words + Word(alphas, alphanums + "_").set_name("name_id")

with

name_id = Combine(~reserved_words + Word(alphas, alphanums + "_")).set_name("name_id")

Please give this a try.

Otherwise, I would ask you to pin your pyparsing version to an existing version that works for you, until I can introduce this other change in a 3.1.0 pre-release.

@nijel
Copy link
Contributor Author

nijel commented Nov 2, 2021

The workaround with Combine doesn't work (or at least not with other changes) - it then seems to parse only first letter instead of whole word. It might be a bug in our parser as well, TBH my knowledge of this code is quite limited...

@nijel nijel changed the title Regression in 3.0.2 and 3.0.3 Regression in 3.0.2, 3.0.3, and 3.0.4 Nov 2, 2021
@ptmcg
Copy link
Member

ptmcg commented Nov 2, 2021

ok. I won't be able to look at this again for a few days now, so for now you should probably pin to 2.4.7. I would not recommend pinning to 3.0.1, it is broken in some other areas. I'll leave this open so I can revisit - hopefully this weekend.

@nijel
Copy link
Contributor Author

nijel commented Nov 3, 2021

I've already ported the code to 3 API, so pinning to 2.4.7 would need reverting these changes. We will stick with 3.0.1 for now as it works fine in the test suite and I believe it covers most of the use cases.

@ptmcg
Copy link
Member

ptmcg commented Nov 8, 2021

I have a modified version of rc.py and po2rc.py that work with pyparsing 3.0.5 (in the attached zip file). I think that there are some issues with the way the negative lookahead created with the ~ operator (generating a NotAny) has changed when used in transform_string. I've reworked the rc grammar to move the negative lookahead into the corresponding ZeroOrMore constructs, where it is behaving correctly now.
modified_rc_files.zip

@nijel
Copy link
Contributor Author

nijel commented Nov 8, 2021

Thanks, that makes sense. I've turned it into translate/translate#4468

nijel pushed a commit to translate/translate that referenced this issue Nov 8, 2021
nijel added a commit to WeblateOrg/weblate that referenced this issue Nov 8, 2021
This is needed to address bug in the translate-toolkit RC parser, see
pyparsing/pyparsing#323
ptmcg added a commit that referenced this issue Nov 12, 2021
@ptmcg
Copy link
Member

ptmcg commented Nov 13, 2021

Can this issue be closed now?

@nijel
Copy link
Contributor Author

nijel commented Nov 13, 2021

For translate-toolkit it is fixed. It's up to you if you consider this big in pyparsing or in our parser

@ptmcg
Copy link
Member

ptmcg commented Nov 13, 2021

The negative lookahead works in pyparsing for normal calls to parse_string, but is inherently limited when used in scan_string or its two related methods transform_string and search_string (which are just thin wrappers around scan_string). Say you have a grammar to parse identifiers, that don't start with keywords, and input and print are keywords.

text = "input x y print x y"

keyword = one_of("print input", as_keyword=True)
identifier = ~keyword + Word(alphas, alphanums)

parse_string does what we want

result = (keyword + Group(ZeroOrMore(identifier))).parse_string(text)
print(result)

['input', ['x', 'y']]

But not so with scan_string. Scanning a character at a time will not match 'input', but will match 'nput'

print(identifier.search_string(text))

[['nput'], ['x'], ['y'], ['rint'], ['x'], ['y']]

Upcase identifiers will upcase more than we want:

identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))

iNPUTXY pRINTXY

Making our transformer recognize keywords, but not transform them is one solution.

identifier = Word(alphas, alphanums)
identifier.add_parse_action(pyparsing_common.upcase_tokens)
transformer = keyword | identifier
print(transformer.transform_string(text))

input X Y print X Y

Adding WordStart before the negative lookahead is another.

identifier = WordStart() + ~keyword + Word(alphas, alphanums)
print(identifier.search_string(text))
identifier.add_parse_action(pyparsing_common.upcase_tokens)
print(identifier.transform_string(text))

[['x'], ['y'], ['x'], ['y']]
input X Y print X Y

With this simple parser, I couldn't come up with an easy example using the stop_on approach that I sent you earlier. After making this write-up, I think adding WordStart() before the negative lookahead might actually be the cleaner solution.

And lastly, why did this work in 2.4.7 but break in 3.0? I think it goes back to how whitespace skipping was changed for the ParseExpression subclasses Or and MatchFirst, so that these expressions would not implicitly override leaveWhitespace() that might be defined in one of their alternative sub-expressions.

So for pyparsing, I'd say it's not so much a bug as a limitation - if anything, the bug is that it is not documented better. I'll probably lift much of this comment into a wiki page or blog post to try to address that.

@nijel
Copy link
Contributor Author

nijel commented Nov 14, 2021

Documenting this is certainly a good approach.

@ptmcg
Copy link
Member

ptmcg commented Jan 23, 2022

@ptmcg ptmcg closed this as completed Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants