Improve Requirement/Marker parser with context-sensitive tokenisation #624

pradyunsg · 2022-12-03T19:03:00Z

Closes #336
Closes #432
Closes #529
Closes #592
Closes #593
Closes #618
Closes #619
Closes #621
Closes #622
Closes #623

This builds upon #484.

The main change here is couple the tokenisation with the parsing, with the regex matches being done based on what the parser is expecting. This should make the tokenisation parts much faster. This now explicitly handles whitespace as a token (explicitly "consuming" it during parsing), and parses markers as a part of the parsing a requirement string (instead of delegating to Marker) to provide better syntax error messages.

I also took the opportunity to improve the error messages presented by this, to more clearly indicate what went wrong and provide contextual information about the syntax error to the caller (in an exception).

Demo of the error messages:

packaging.requirements.InvalidRequirement: Expected closing RIGHT_BRACKET
    package[a ; python_version <= '3.2'
           ~~~^

packaging.requirements.InvalidRequirement: Expected end or semicolon (after URL and whitespace)
    package @ https://example.com/; python_version <= '3.2'
              ~~~~~~~~~~~~~~~~~~~~~~^

packaging.requirements.InvalidRequirement: Expected end or semicolon (after name and no valid version specifier)
    package#
           ^

packaging.requirements.InvalidRequirement: Expected version after operator
    package==random
             ^

/cc @hrnciar for inputs and thoughts!

packaging/_parser.py

pradyunsg · 2022-12-04T10:57:54Z

Ahaha, I covered an error case that the existing tests don't exercise triggering coverage-related CI errors.

This makes it a fully-fleshed-out class for holding data.

This also pulls out the error message formatting logic into the error itself.

This helps pyright better understand what's happening.

These provide a consistent call signatures into the parser. This also decouples the tokenizer from the `Marker` class.

This makes it easier to read through the function, with a clearer name.

This draws a clear distinction between this and the user-visible `Requirement` object.

This reduces how many regex patterns would be matched against the input while also enabling the parser to resolve ambiguity in-place.

This allows for nicer error messages, which show the entire requirement string and highlight the marker in particular.

This eliminates a point of duplication and ensures that the error messaging is consistent.

This makes it easier to identify what position the parser was checking, compared to relevant context to the reader.

This is more permitting and better handles tabs used as whitespace.

This follows what PEP 508's grammar says is a valid identifier.

This makes it possible for the arbitrary matches to be used within requirement specifiers without special constraints.

This makes it clearer in the docstring grammars that a name without any specifier is valid.

This makes the control flow slightly easier to understand.

This now exercises more edge cases and validates that the error messages are well-formed.

pradyunsg · 2022-12-05T23:59:12Z

"How do you relax after dealing with a tricky Python packaging thing at work?"
"I spend 3 hours polishing up a parser I worked on and then break up a giant WIP commit into something legible."

This PR is now ready for review, with rewritten tests for the requirements parser. I like them because they test a lot of test strings (the earlier test suite only checked the parser against ~45 strings in the entire run), many of these strings helped identify logical mistakes in the parser's reworking. Notably, the parser on main fails various -k basic tests added in this PR, that 21.3 as well as this PR pass.

I reckon the easiest way to review the changes here would be to look at the test suite in its final state; and then to go commit-by-commit to see how things evolve. There's still a giant commit with the main reworking -- couldn't figure out how to break that up in the time I had at hand. :)

pradyunsg · 2022-12-06T00:11:42Z

@uranusjr @encukou @hroncok @hrnciar: Y'all reviewed #484, so... would one or more of you be willing to review this PR (which is basically a follow-up to that)? Sorry for the spammy mention. 😅

This makes it easier to understand what the state of the parser is and what is expected at that point.

hrnciar · 2022-12-06T15:09:21Z

I went through all commits and nothing else struck my eye. Looks good to me. Thank you for the improvements :) .

This ensures that these error tracebacks correctly describe the causality between the two errors.

This ensures that a marker with whitespace around it is parsed correctly.

This is better aligned with the naming from PEP 508.

pradyunsg · 2022-12-07T00:29:59Z

I've gone through the issue tracker to check which other issues this PR fixes, and... it's quite a few. Given that, I've stretched this PR a little bit to cover #432 in this.

This ensures that these are only parsed when they're independent words.

The listed operators were incorrect.

pradyunsg · 2022-12-07T00:46:22Z

ff75da7 (#624) removes scope for any regressions along the lines of #618 in the future.

This is more consistent with the rest of the format which is largely whitespace agnostic.

brettcannon

Quick perusal didn't turn anything up, but I'm not a parsing expert. 😅

pradyunsg · 2022-12-07T22:38:55Z

Thanks @brettcannon and @hrnciar for the reviews!

hrnciar

Apparently, my comment wasn't posted and it was hanging in the pending state. I'll file PR to remove it.

PR: #630

packaging/_tokenizer.py

pradyunsg marked this pull request as draft December 3, 2022 19:03

pradyunsg force-pushed the requirement-parser-rewrite branch 5 times, most recently from 7f6ff35 to 5cbd07f Compare December 3, 2022 20:15

Jackenmen reviewed Dec 3, 2022

View reviewed changes

packaging/_parser.py Show resolved Hide resolved

hrnciar mentioned this pull request Dec 5, 2022

Requirement('installer') errors after 2bd5da391c302f2f5a18fd3e2bd9fb3b75c02e34 #618

Closed

pradyunsg changed the title ~~Rewrite the requirement and marker parser~~ Improve the requirement and marker parser with context-sensitive tokenisation Dec 5, 2022

pradyunsg changed the title ~~Improve the requirement and marker parser with context-sensitive tokenisation~~ Improve the requirement parser with context-sensitive tokenisation Dec 5, 2022

pradyunsg changed the title ~~Improve the requirement parser with context-sensitive tokenisation~~ Improve Requirement/Marker parser with context-sensitive tokenisation Dec 5, 2022

pradyunsg added 16 commits December 5, 2022 22:54

Convert Token into a dataclass

2ceccfc

This makes it a fully-fleshed-out class for holding data.

Convert parser exception into a rich exception class

a25e85f

This also pulls out the error message formatting logic into the error itself.

Use a richer type for Tokenizer.rules

09f31ff

This helps pyright better understand what's happening.

Provide dedicated parse_{requirement,marker}(str) functions

1c930f1

These provide a consistent call signatures into the parser. This also decouples the tokenizer from the `Marker` class.

Rename req to parsed in Requirement.__init__

650c7c6

This makes it easier to read through the function, with a clearer name.

Rename parser's Requirement to ParsedRequirement

282b4e1

This draws a clear distinction between this and the user-visible `Requirement` object.

Rework the parser with context-sensitive tokenisation

07bf6f4

This reduces how many regex patterns would be matched against the input while also enabling the parser to resolve ambiguity in-place.

Parse markers inline when parsing requirements

c6baf52

This allows for nicer error messages, which show the entire requirement string and highlight the marker in particular.

Factor out parsing semicolon-marker for requirements

6b2f3de

This eliminates a point of duplication and ensures that the error messaging is consistent.

Tweak the presentation of ParserSyntaxError spans

177e9ff

This makes it easier to identify what position the parser was checking, compared to relevant context to the reader.

Make URLs match "not whitespace"

1c3f900

This is more permitting and better handles tabs used as whitespace.

Update IDENTIFIER to match PEP 508's stipulated syntax

4a4d835

This follows what PEP 508's grammar says is a valid identifier.

Make arbitrary version matching accept what LegacySpecifier did

39ae524

This makes it possible for the arbitrary matches to be used within requirement specifiers without special constraints.

Better reflect what is optional within specifier/version_many

97e7649

This makes it clearer in the docstring grammars that a name without any specifier is valid.

Flatten nested ifs into if-elif

92b9545

This makes the control flow slightly easier to understand.

Rewrite test suite for requirements parsing

3a7cdb6

This now exercises more edge cases and validates that the error messages are well-formed.

pradyunsg force-pushed the requirement-parser-rewrite branch from 5cbd07f to 3a7cdb6 Compare December 5, 2022 23:41

pradyunsg marked this pull request as ready for review December 5, 2022 23:43

pradyunsg mentioned this pull request Dec 6, 2022

Release 22.0 #569

Closed

pradyunsg added enhancement packaging.requirements packaging.markers labels Dec 6, 2022

Improve error message for bad version specifiers in Requirement

0399eaf

This makes it easier to understand what the state of the parser is and what is expected at that point.

pradyunsg added 3 commits December 6, 2022 23:48

Add ParserSyntaxError as the cause of Invalid{Requirement/Marker}

83aae66

This ensures that these error tracebacks correctly describe the causality between the two errors.

Permit whitespace around marker_atom

163993a

This ensures that a marker with whitespace around it is parsed correctly.

Rename marker_expr to marker

fa4b69d

This is better aligned with the naming from PEP 508.

pradyunsg mentioned this pull request Dec 7, 2022

InvalidRequirement: name@ git+...@branch; (python_version=="...") and extra == "..." #432

Closed

pradyunsg added 2 commits December 7, 2022 00:44

Enforce word boundaries in operators and names

ff75da7

This ensures that these are only parsed when they're independent words.

Fix a typo in an error message

4945856

The listed operators were incorrect.

This was referenced Dec 7, 2022

Present a clearer error message when the user misses a space before ; in marker #529

Closed

Requirement parser does not support arbitrary equality in legacy parentheses format #336

Closed

Permit arbitrary whitespace around versions specifier in parenthesis

7869a1a

This is more consistent with the rest of the format which is largely whitespace agnostic.

brettcannon approved these changes Dec 7, 2022

View reviewed changes

pradyunsg merged commit b997a48 into pypa:main Dec 7, 2022

pradyunsg deleted the requirement-parser-rewrite branch December 7, 2022 21:50

hrnciar reviewed Dec 8, 2022

View reviewed changes

packaging/_tokenizer.py Show resolved Hide resolved

pradyunsg mentioned this pull request Dec 22, 2022

raise NotImplementedError("unreachable") when a falsifying case is found Zac-HD/hypofuzz#15

Closed

pradyunsg mentioned this pull request Jan 9, 2023

Upgrade the vendored packaging to 22.0+ pypa/pip#11715

Closed

1 task

This was referenced Mar 16, 2023

Bump wheel from 0.38.4 to 0.40.0 in /glean-core/python mozilla/glean#2419

Merged

Requirement ranges incompatibility with vendored packaging v23.0 pypa/wheel#520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Requirement/Marker parser with context-sensitive tokenisation #624

Improve Requirement/Marker parser with context-sensitive tokenisation #624

pradyunsg commented Dec 3, 2022 •

edited

pradyunsg commented Dec 4, 2022

pradyunsg commented Dec 5, 2022 •

edited

pradyunsg commented Dec 6, 2022

hrnciar commented Dec 6, 2022

pradyunsg commented Dec 7, 2022 •

edited

pradyunsg commented Dec 7, 2022

brettcannon left a comment

pradyunsg commented Dec 7, 2022

hrnciar left a comment •

edited

Improve Requirement/Marker parser with context-sensitive tokenisation #624

Improve Requirement/Marker parser with context-sensitive tokenisation #624

Conversation

pradyunsg commented Dec 3, 2022 • edited

pradyunsg commented Dec 4, 2022

pradyunsg commented Dec 5, 2022 • edited

pradyunsg commented Dec 6, 2022

hrnciar commented Dec 6, 2022

pradyunsg commented Dec 7, 2022 • edited

pradyunsg commented Dec 7, 2022

brettcannon left a comment

Choose a reason for hiding this comment

pradyunsg commented Dec 7, 2022

hrnciar left a comment • edited

Choose a reason for hiding this comment

pradyunsg commented Dec 3, 2022 •

edited

pradyunsg commented Dec 5, 2022 •

edited

pradyunsg commented Dec 7, 2022 •

edited

hrnciar left a comment •

edited