Fails to parse backslash on line by itself #4261

JelleZijlstra · 2024-02-29T06:25:53Z

% cat parse.py 
class Plotter:
\
    pass
% black parse.py 
error: cannot format parse.py: Cannot parse: 3:4:     pass

Oh no! 💥 💔 💥
1 file failed to reformat.
% black --version
black, 24.2.0 (compiled: yes)
Python (CPython) 3.12.0
% python parse.py
%

Saw this in astral-sh/ruff#10099; the example is minified from the repro case in that issue.

The text was updated successfully, but these errors were encountered:

sumezulike · 2024-03-03T03:41:45Z

So I looked into it and the issue seems to lie here:

The tokenizer tracks backslash-escaped newlines by setting the continued flag on the line ending with \, then skips checking for indentation on the next line.

black/src/blib2to3/pgen2/tokenize.py

Lines 494 to 559 in e4bfedb

    
           elif parenlev == 0 and not continued:  # new statement 
        
               if not line: 
        
                   break 
        
               column = 0 
        
               while pos < max:  # measure leading whitespace 
        
                   if line[pos] == " ": 
        
                       column += 1 
        
                   elif line[pos] == "\t": 
        
                       column = (column // tabsize + 1) * tabsize 
        
                   elif line[pos] == "\f": 
        
                       column = 0 
        
                   else: 
        
                       break 
        
                   pos += 1 
        
               if pos == max: 
        
                   break 
        
               if stashed: 
        
                   yield stashed 
        
                   stashed = None 
        
               if line[pos] in "\r\n":  # skip blank lines 
        
                   yield (NL, line[pos:], (lnum, pos), (lnum, len(line)), line) 
        
                   continue 
        
               if line[pos] == "#":  # skip comments 
        
                   comment_token = line[pos:].rstrip("\r\n") 
        
                   nl_pos = pos + len(comment_token) 
        
                   yield ( 
        
                       COMMENT, 
        
                       comment_token, 
        
                       (lnum, pos), 
        
                       (lnum, nl_pos), 
        
                       line, 
        
                   ) 
        
                   yield (NL, line[nl_pos:], (lnum, nl_pos), (lnum, len(line)), line) 
        
                   continue 
        
               if column > indents[-1]:  # count indents 
        
                   indents.append(column) 
        
                   yield (INDENT, line[:pos], (lnum, 0), (lnum, pos), line) 
        
               while column < indents[-1]:  # count dedents 
        
                   if column not in indents: 
        
                       raise IndentationError( 
        
                           "unindent does not match any outer indentation level", 
        
                           ("<tokenize>", lnum, pos, line), 
        
                       ) 
        
                   indents = indents[:-1] 
        
                   if async_def and async_def_indent >= indents[-1]: 
        
                       async_def = False 
        
                       async_def_nl = False 
        
                       async_def_indent = 0 
        
                   yield (DEDENT, "", (lnum, pos), (lnum, pos), line) 
        
               if async_def and async_def_nl and async_def_indent >= indents[-1]: 
        
                   async_def = False 
        
                   async_def_nl = False 
        
                   async_def_indent = 0 
        
           else:  # continued statement 
        
               if not line: 
        
                   raise TokenError("EOF in multi-line statement", (lnum, 0)) 
        
               continued = 0

That means that the input

class Plotter:
\
    pass

generates these tokens:

NAME "class"
NAME "Plotter"
OP ":"
NEWLINE "\n"
NL "\\\n"
NAME "pass"
NEWLINE "\n"

whereas

class Plotter:
    pass

generates these:

NAME "class"
NAME "Plotter"
OP ":"
NEWLINE "\n"
INDENT
NAME "pass"
NEWLINE "\n"
DEDENT

But the grammar always expects an indent after the newline:

black/src/blib2to3/Grammar.txt

Line 127 in e4bfedb

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

So add_token doesn't find any valid state and raises an error.

black/src/blib2to3/pgen2/parse.py

Lines 333 to 334 in f03ee11

    
           # No success finding a transition 
        
           raise ParseError("bad input", type, value, context)

I feel like it would make more sense to add special handling in the tokenizer than to change the grammar.

Anyone have a suggestion?

JelleZijlstra · 2024-03-03T03:56:09Z

I agree this should probably be fixed in the tokenizer. I don't have strong feelings on what the fix should look like, but presumably we should try to match what the CPython tokenizer does.

JelleZijlstra · 2024-03-03T03:58:31Z

Fun fact, this was illegal in earlier versions of Python. I guess the new parser must have somehow made this allowed:

% ~/.pyenv/versions/3.7.16/bin/python -c '''class Plotter:
\
    pass
'''
  File "<string>", line 3
    pass
       ^
IndentationError: expected an indented block
% ~/.pyenv/versions/3.9.16/bin/python -c '''class Plotter:
\
    pass
'''
  File "<string>", line 3
        pass
      ^
IndentationError: expected an indented block
% ~/.pyenv/versions/3.10.9/bin/python -c '''class Plotter:
\
    pass
'''
%

sumezulike · 2024-03-03T04:09:56Z

Oh, haha! My first thought was even "That doesn't look allowed" but then it just ran.

I'll take a look at how this is handled in CPython and try to write a fix!

Frenchcoder294 · 2024-04-06T16:02:57Z

can't we modify the add_token function to do so:

def _addtoken(self, ilabel: int, type: int, value: str, context: Context) -> bool:
    # Detect backslash-escaped newlines
    if value == '\\\n':
        # Special handling for backslash-escaped newlines
        # Perform token shifting or processing specific to this case
        # (e.g, continue the line without shifting tokens)
        # we may need to adjust the control flow or introduce additional checks here
        pass
    else:
        # Original token processing logic

JelleZijlstra · 2024-04-06T16:11:26Z

Feel free to submit a PR with test cases if your proposed change fixes this issue.

tusharsadhwani · 2024-05-03T14:53:47Z

@JelleZijlstra I have proposed a fix, but I feel it will require extensive testing as well. I've tried running it on a few open source projects and haven't seen any unexpected crashes etc.

JelleZijlstra added T: bug Something isn't working C: parser How we parse code. Or fail to parse it. labels Feb 29, 2024

tusharsadhwani linked a pull request Apr 30, 2024 that will close this issue

tokenizer: skip lines that are just slash and whitespace #4343

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to parse backslash on line by itself #4261

Fails to parse backslash on line by itself #4261

JelleZijlstra commented Feb 29, 2024

sumezulike commented Mar 3, 2024

JelleZijlstra commented Mar 3, 2024

JelleZijlstra commented Mar 3, 2024

sumezulike commented Mar 3, 2024

Frenchcoder294 commented Apr 6, 2024

JelleZijlstra commented Apr 6, 2024

tusharsadhwani commented May 3, 2024

Fails to parse backslash on line by itself #4261

Fails to parse backslash on line by itself #4261

Comments

JelleZijlstra commented Feb 29, 2024

sumezulike commented Mar 3, 2024

JelleZijlstra commented Mar 3, 2024

JelleZijlstra commented Mar 3, 2024

sumezulike commented Mar 3, 2024

Frenchcoder294 commented Apr 6, 2024

JelleZijlstra commented Apr 6, 2024

tusharsadhwani commented May 3, 2024