Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Requirement/Marker parser with context-sensitive tokenisation #624

Merged
merged 23 commits into from Dec 7, 2022

Conversation

pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented Dec 3, 2022

Closes #336
Closes #432
Closes #529
Closes #592
Closes #593
Closes #618
Closes #619
Closes #621
Closes #622
Closes #623

This builds upon #484.

The main change here is couple the tokenisation with the parsing, with the regex matches being done based on what the parser is expecting. This should make the tokenisation parts much faster. This now explicitly handles whitespace as a token (explicitly "consuming" it during parsing), and parses markers as a part of the parsing a requirement string (instead of delegating to Marker) to provide better syntax error messages.

I also took the opportunity to improve the error messages presented by this, to more clearly indicate what went wrong and provide contextual information about the syntax error to the caller (in an exception).

Demo of the error messages:

packaging.requirements.InvalidRequirement: Expected closing RIGHT_BRACKET
    package[a ; python_version <= '3.2'
           ~~~^
packaging.requirements.InvalidRequirement: Expected end or semicolon (after URL and whitespace)
    package @ https://example.com/; python_version <= '3.2'
              ~~~~~~~~~~~~~~~~~~~~~~^
packaging.requirements.InvalidRequirement: Expected end or semicolon (after name and no valid version specifier)
    package#
           ^
packaging.requirements.InvalidRequirement: Expected version after operator
    package==random
             ^

/cc @hrnciar for inputs and thoughts!

@pradyunsg pradyunsg marked this pull request as draft December 3, 2022 19:03
@pradyunsg pradyunsg force-pushed the requirement-parser-rewrite branch 5 times, most recently from 7f6ff35 to 5cbd07f Compare December 3, 2022 20:15
@pradyunsg
Copy link
Member Author

Ahaha, I covered an error case that the existing tests don't exercise triggering coverage-related CI errors.

@pradyunsg pradyunsg changed the title Rewrite the requirement and marker parser Improve the requirement and marker parser with context-sensitive tokenisation Dec 5, 2022
@pradyunsg pradyunsg changed the title Improve the requirement and marker parser with context-sensitive tokenisation Improve the requirement parser with context-sensitive tokenisation Dec 5, 2022
@pradyunsg pradyunsg changed the title Improve the requirement parser with context-sensitive tokenisation Improve Requirement/Marker parser with context-sensitive tokenisation Dec 5, 2022
This makes it a fully-fleshed-out class for holding data.
This also pulls out the error message formatting logic into the error
itself.
This helps pyright better understand what's happening.
These provide a consistent call signatures into the parser. This also
decouples the tokenizer from the `Marker` class.
This makes it easier to read through the function, with a clearer name.
This draws a clear distinction between this and the user-visible
`Requirement` object.
This reduces how many regex patterns would be matched against the input
while also enabling the parser to resolve ambiguity in-place.
This allows for nicer error messages, which show the entire requirement
string and highlight the marker in particular.
This eliminates a point of duplication and ensures that the error
messaging is consistent.
This makes it easier to identify what position the parser was checking,
compared to relevant context to the reader.
This is more permitting and better handles tabs used as whitespace.
This follows what PEP 508's grammar says is a valid identifier.
This makes it possible for the arbitrary matches to be used within
requirement specifiers without special constraints.
This makes it clearer in the docstring grammars that a name without
any specifier is valid.
This makes the control flow slightly easier to understand.
This now exercises more edge cases and validates that the error messages
are well-formed.
@pradyunsg pradyunsg marked this pull request as ready for review December 5, 2022 23:43
@pradyunsg
Copy link
Member Author

pradyunsg commented Dec 5, 2022

"How do you relax after dealing with a tricky Python packaging thing at work?"
"I spend 3 hours polishing up a parser I worked on and then break up a giant WIP commit into something legible."

This PR is now ready for review, with rewritten tests for the requirements parser. I like them because they test a lot of test strings (the earlier test suite only checked the parser against ~45 strings in the entire run), many of these strings helped identify logical mistakes in the parser's reworking. Notably, the parser on main fails various -k basic tests added in this PR, that 21.3 as well as this PR pass.

I reckon the easiest way to review the changes here would be to look at the test suite in its final state; and then to go commit-by-commit to see how things evolve. There's still a giant commit with the main reworking -- couldn't figure out how to break that up in the time I had at hand. :)

@pradyunsg
Copy link
Member Author

@uranusjr @encukou @hroncok @hrnciar: Y'all reviewed #484, so... would one or more of you be willing to review this PR (which is basically a follow-up to that)? Sorry for the spammy mention. 😅

This makes it easier to understand what the state of the parser is and
what is expected at that point.
@hrnciar
Copy link
Contributor

hrnciar commented Dec 6, 2022

I went through all commits and nothing else struck my eye. Looks good to me. Thank you for the improvements :) .

This ensures that these error tracebacks correctly describe the
causality between the two errors.
This ensures that a marker with whitespace around it is parsed
correctly.
This is better aligned with the naming from PEP 508.
@pradyunsg
Copy link
Member Author

pradyunsg commented Dec 7, 2022

I've gone through the issue tracker to check which other issues this PR fixes, and... it's quite a few. Given that, I've stretched this PR a little bit to cover #432 in this.

This ensures that these are only parsed when they're independent words.
The listed operators were incorrect.
@pradyunsg
Copy link
Member Author

ff75da7 (#624) removes scope for any regressions along the lines of #618 in the future.

This is more consistent with the rest of the format which is largely
whitespace agnostic.
Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick perusal didn't turn anything up, but I'm not a parsing expert. 😅

@pradyunsg pradyunsg merged commit b997a48 into pypa:main Dec 7, 2022
@pradyunsg pradyunsg deleted the requirement-parser-rewrite branch December 7, 2022 21:50
@pradyunsg
Copy link
Member Author

Thanks @brettcannon and @hrnciar for the reviews!

Copy link
Contributor

@hrnciar hrnciar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, my comment wasn't posted and it was hanging in the pending state. I'll file PR to remove it.

PR: #630

packaging/_tokenizer.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment