Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization: improve EOF and multi-line statement interaction #1961

Closed
wants to merge 10 commits into from
Closed

Tokenization: improve EOF and multi-line statement interaction #1961

wants to merge 10 commits into from

Conversation

jayaddison
Copy link
Contributor

This is the kind of change that can cause subtle problems in future, so it's one that should be reviewed and considered carefully.

Currently during fuzz testing, hypothesmith is able to find some example cases where apparently-valid Python code that contains a multi-line indicator (\ character) immediately prior to the EOF indicator (without a newline) fails during black's tokenization stage.

This has been reported in #1012 and more recently #1960.

As per the notes in #1012, using code of this kind as input will successfully compile and run in Python.

This changeset permits the black tokenizer to continue when an EOF token is encountered following a multi-line indicator.

Resolves #1012.

@jayaddison
Copy link
Contributor Author

Maybe worth a little more pause here; looking at the current upstream blib2to3, this TokenError exception is still raised when an EOF is encountered during a continued statement (ref).

@jayaddison jayaddison marked this pull request as draft February 11, 2021 21:57
@jayaddison
Copy link
Contributor Author

Eh, ok - I'm going to mark this as ready for review again, having read up a bit more on the status of blib2to3 via bugs.python.org and understanding that black's vendored version has already diverged a little bit. Still feels maybe non-ideal, but I don't have any other changes planned here.

@jayaddison jayaddison marked this pull request as ready for review February 11, 2021 22:05
@ichard26
Copy link
Collaborator

Eh, ok - I'm going to mark this as ready for review again, having read up a bit more on the status of blib2to3 via bugs.python.org and understanding that black's vendored version has already diverged a little bit. Still feels maybe non-ideal, but I don't have any other changes planned here.

We've modified our version of lib2to3 a lot already so this isn't a big deal.

Copy link
Collaborator

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually within a minute I created something that seems to hang Black?

test = (1
        +
        2

This is definitely a no-no even if it's invalid code (EOF in multi-line continuation) IMO.

@ichard26
Copy link
Collaborator

ichard26 commented Feb 14, 2021

Yeah, this seems to be hanging up tokenize.py generate_tokens... cProfile-ing Black on that file and killing it after 30ish seconds of nothing leads to this (partial) call graph:
out

edit: you can get something similar via

$ python -m cProfile -o out.pstats -m black test.py
$ gprof2dot out.pstats -z __init__:366:main -n0.01 -e0.01 | dot -Tsvg -o out.svg  # `gprof2dot` is from the `yelp-gprof2dot` PyPI package

@jayaddison jayaddison marked this pull request as draft February 15, 2021 10:58
@jayaddison
Copy link
Contributor Author

Thanks @ichard26 - I'll take a look into that.

… after a multi-line indicator"

This reverts commit b2c6670.
This works in 'fast' mode but appears to introduce some code output equivalence issues
@jayaddison
Copy link
Contributor Author

The latest pushed changes may appear to work based on the test coverage, but in fact there are code output equivalence issues. The following input file (with no closing end-of-line) produces a different abstract syntax tree during the second formatter pass:

def foo(a):
    pass\

(it works with the --fast CLI flag provided, since that skips the second pass)

@jayaddison
Copy link
Contributor Author

I think this is closable as a wont-fix; I'll add some more explanation soon in the linked issue.

@jayaddison jayaddison closed this Feb 15, 2021
@jayaddison jayaddison deleted the tokenization/backslash-eof branch February 15, 2021 18:38
@jayaddison
Copy link
Contributor Author

(in hindsight, assuming a wont-fix resolution wasn't sensible of me; this could still be addressed if there's appetite to)

@jayaddison jayaddison restored the tokenization/backslash-eof branch February 18, 2021 22:26
@jayaddison jayaddison reopened this Feb 18, 2021
@jayaddison jayaddison changed the title Tokenization: allow formatting of files that contain a multi-line indicator prior to EOF Tokenization: improve EOF and multi-line statement interaction Feb 18, 2021
@jayaddison jayaddison closed this Mar 13, 2021
@jayaddison jayaddison deleted the tokenization/backslash-eof branch March 13, 2021 19:05
@jayaddison
Copy link
Contributor Author

(just cleaning up and closing some old / stale pull requests that I've been working on. this turned out to be a more-involved fix than it looked like originally; some adjustments were made to the hyptothesmith config in #1998 to mitigate the test failures that were occurring as a result)

@ichard26
Copy link
Collaborator

@jayaddison and we really appreciate it btw! We want to refactor the single-file Black package into a multi-module package to improve maintainability (see GH-1746). Unfortunately such a change would break basically every PR, so we're working on reducing the PR backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Black fails to tokenise files ending with a backslash
3 participants