Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce usage of regex #2644

Merged
merged 11 commits into from Dec 1, 2021
Merged

Reduce usage of regex #2644

merged 11 commits into from Dec 1, 2021

Conversation

JelleZijlstra
Copy link
Collaborator

This removes all but one usage of the regex dependency. Tricky bits included:

  • A bug in test_black.py where we were incorrectly using a character range. Fix also submitted separately in fix regex in test_black.py #2643.
  • tokenize.py was the original use case for regex (#455 Fix bug with tricky unicode symbols #1047). The important bit is that we rely on \w to match anything valid in an identifier, and re fails to match a few characters as part of identifiers. My solution is to instead match all characters except those we know to mean something else in Python: whitespace and ASCII punctuation. This will make Black able to parse some invalid Python programs, like those that contain non-ASCII punctuation in the place of an identifier, but that seems fine to me.
  • One import of regex remains, in trans.py. We use a recursive regex to parse f-strings, and only regex supports that. I haven't thought of a better fix there (except maybe writing a manual parser), so I'm leaving that for now.

My goal is to remove the regex dependency to reduce the risk of breakage due to dependencies and make life easier for users on platforms without wheels.

Copy link
Collaborator

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it and so does diff-shades - reporting zero changes. Probably worth putting some more thoughts into the tokenize hack before merging this though.

@@ -6,6 +6,7 @@

- Fixed Python 3.10 support on platforms without ProcessPoolExecutor (#2631)
- Fixed `match` statements with open sequence subjects, like `match a, b:` (#2639)
- Reduce usage of the `regex` dependency (#2644)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth mentioning in the changelog? This doesn't really have an impact on end users since we still depend on regex unconditionally so all of the problem involved in that will persist. Not a big deal but I wanted to flag this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that there's some chance it does have an impact on end users, so it's worth mentioning.

@@ -86,7 +86,7 @@ def _combinations(*l):
Comment = r"#[^\r\n]*"
Ignore = Whitespace + any(r"\\\r?\n" + Whitespace) + maybe(Comment)
Name = ( # this is invalid but it's fine because Name comes after Number in all groups
r"\w+"
r"[^\s#\(\)\[\]\{\}+\-*/!@$%^&=|;:'\",\.<>/?`~\\]+"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we might need to add a FAQ entry describing why Black is incredibly inconsistent detecting invalid syntax. We don't promise that Black will fail on all invalid code but people do reasonably assume consistency. We don't need to get into the nitty gritty but simply explaining how it requires less work while achieving a high degree compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can add that separately.

@@ -70,7 +70,7 @@
R = TypeVar("R")

# Match the time output in a diff, but nothing else
DIFF_TIME = re.compile(r"\t[\d-:+\. ]+")
DIFF_TIME = re.compile(r"\t[\d\-:+\. ]+")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@JelleZijlstra
Copy link
Collaborator Author

The fuzz failure is real but happens on main too. Reported #2651.

JelleZijlstra added a commit that referenced this pull request Nov 30, 2021
@JelleZijlstra JelleZijlstra merged commit 5e2bb52 into main Dec 1, 2021
@JelleZijlstra JelleZijlstra deleted the noregex branch December 1, 2021 02:01
ichard26 added a commit that referenced this pull request Dec 2, 2021
We were no longer using it since GH-2644 and GH-2654. This should hopefully
make using Black easier to use as there's one less compiled dependency.
The core team also doesn't have to deal with the surprisingly frequent fires
the regex packaging setup goes through.

Co-authored-by: Richard Si <63936253+ichard26@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants