Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to parse backslash on line by itself #4261

Open
JelleZijlstra opened this issue Feb 29, 2024 · 7 comments 路 May be fixed by #4343
Open

Fails to parse backslash on line by itself #4261

JelleZijlstra opened this issue Feb 29, 2024 · 7 comments 路 May be fixed by #4343
Labels
C: parser How we parse code. Or fail to parse it. T: bug Something isn't working

Comments

@JelleZijlstra
Copy link
Collaborator

% cat parse.py 
class Plotter:
\
    pass
% black parse.py 
error: cannot format parse.py: Cannot parse: 3:4:     pass

Oh no! 馃挜 馃挃 馃挜
1 file failed to reformat.
% black --version
black, 24.2.0 (compiled: yes)
Python (CPython) 3.12.0
% python parse.py
% 

Saw this in astral-sh/ruff#10099; the example is minified from the repro case in that issue.

@JelleZijlstra JelleZijlstra added T: bug Something isn't working C: parser How we parse code. Or fail to parse it. labels Feb 29, 2024
@sumezulike
Copy link
Contributor

So I looked into it and the issue seems to lie here:

The tokenizer tracks backslash-escaped newlines by setting the continued flag on the line ending with \, then skips checking for indentation on the next line.

elif parenlev == 0 and not continued: # new statement
if not line:
break
column = 0
while pos < max: # measure leading whitespace
if line[pos] == " ":
column += 1
elif line[pos] == "\t":
column = (column // tabsize + 1) * tabsize
elif line[pos] == "\f":
column = 0
else:
break
pos += 1
if pos == max:
break
if stashed:
yield stashed
stashed = None
if line[pos] in "\r\n": # skip blank lines
yield (NL, line[pos:], (lnum, pos), (lnum, len(line)), line)
continue
if line[pos] == "#": # skip comments
comment_token = line[pos:].rstrip("\r\n")
nl_pos = pos + len(comment_token)
yield (
COMMENT,
comment_token,
(lnum, pos),
(lnum, nl_pos),
line,
)
yield (NL, line[nl_pos:], (lnum, nl_pos), (lnum, len(line)), line)
continue
if column > indents[-1]: # count indents
indents.append(column)
yield (INDENT, line[:pos], (lnum, 0), (lnum, pos), line)
while column < indents[-1]: # count dedents
if column not in indents:
raise IndentationError(
"unindent does not match any outer indentation level",
("<tokenize>", lnum, pos, line),
)
indents = indents[:-1]
if async_def and async_def_indent >= indents[-1]:
async_def = False
async_def_nl = False
async_def_indent = 0
yield (DEDENT, "", (lnum, pos), (lnum, pos), line)
if async_def and async_def_nl and async_def_indent >= indents[-1]:
async_def = False
async_def_nl = False
async_def_indent = 0
else: # continued statement
if not line:
raise TokenError("EOF in multi-line statement", (lnum, 0))
continued = 0

That means that the input

class Plotter:
\
    pass

generates these tokens:

NAME "class"
NAME "Plotter"
OP ":"
NEWLINE "\n"
NL "\\\n"
NAME "pass"
NEWLINE "\n"

whereas

class Plotter:
    pass

generates these:

NAME "class"
NAME "Plotter"
OP ":"
NEWLINE "\n"
INDENT
NAME "pass"
NEWLINE "\n"
DEDENT

But the grammar always expects an indent after the newline:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

So add_token doesn't find any valid state and raises an error.

# No success finding a transition
raise ParseError("bad input", type, value, context)

I feel like it would make more sense to add special handling in the tokenizer than to change the grammar.

Anyone have a suggestion?

@JelleZijlstra
Copy link
Collaborator Author

I agree this should probably be fixed in the tokenizer. I don't have strong feelings on what the fix should look like, but presumably we should try to match what the CPython tokenizer does.

@JelleZijlstra
Copy link
Collaborator Author

Fun fact, this was illegal in earlier versions of Python. I guess the new parser must have somehow made this allowed:

% ~/.pyenv/versions/3.7.16/bin/python -c '''class Plotter:
\
    pass
'''
  File "<string>", line 3
    pass
       ^
IndentationError: expected an indented block
% ~/.pyenv/versions/3.9.16/bin/python -c '''class Plotter:
\
    pass
'''
  File "<string>", line 3
        pass
      ^
IndentationError: expected an indented block
% ~/.pyenv/versions/3.10.9/bin/python -c '''class Plotter:
\
    pass
'''
%

@sumezulike
Copy link
Contributor

Oh, haha! My first thought was even "That doesn't look allowed" but then it just ran.

I'll take a look at how this is handled in CPython and try to write a fix!

@Frenchcoder294
Copy link

can't we modify the add_token function to do so:

def _addtoken(self, ilabel: int, type: int, value: str, context: Context) -> bool:
    # Detect backslash-escaped newlines
    if value == '\\\n':
        # Special handling for backslash-escaped newlines
        # Perform token shifting or processing specific to this case
        # (e.g, continue the line without shifting tokens)
        # we may need to adjust the control flow or introduce additional checks here
        pass
    else:
        # Original token processing logic

@JelleZijlstra
Copy link
Collaborator Author

Feel free to submit a PR with test cases if your proposed change fixes this issue.

@tusharsadhwani tusharsadhwani linked a pull request Apr 30, 2024 that will close this issue
3 tasks
@tusharsadhwani
Copy link
Contributor

@JelleZijlstra I have proposed a fix, but I feel it will require extensive testing as well. I've tried running it on a few open source projects and haven't seen any unexpected crashes etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: parser How we parse code. Or fail to parse it. T: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants