Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error parsing Python raw f-string with backslash #1681

Closed
jmd-dk opened this issue Jan 13, 2021 · 19 comments
Closed

Error parsing Python raw f-string with backslash #1681

jmd-dk opened this issue Jan 13, 2021 · 19 comments
Labels
A-lexing area: changes to individual lexers good first issue Good for newcomers S-minor severity: minor T-bug type: a bug

Comments

@jmd-dk
Copy link
Contributor

jmd-dk commented Jan 13, 2021

Python(3) code using raw f-strings containing a backslash within escaped braces (i.e. double braces) cannot be parsed. A minimum example is

fr'{{\S'

Some pointers:

  • Works: f'{{\S'.
  • Works: r'{{\S'
  • Works: rf'{{'.
  • Works: rf'\S'.
  • Also do not work: rf'{{\S'.
  • The S is not important, though some character must be there in order for the syntax to be valid.

This might seem like a pretty esoteric case. For me it comes up a lot when writing LaTeX in Matplotlib.

@Anteru Anteru added T-bug type: a bug A-lexing area: changes to individual lexers S-minor severity: minor labels Jan 14, 2021
@Anteru Anteru changed the title [Bug] Error parsing Python raw f-string with backslash Error parsing Python raw f-string with backslash Jan 14, 2021
@Anteru Anteru added the good first issue Good for newcomers label Jan 14, 2021
@jmd-dk
Copy link
Contributor Author

jmd-dk commented Jan 14, 2021

I think I now understand the problem and have the solution.

In lexers/python.py under # raw f-strings, only quotes ('tdqf', 'tsqf', 'dqf', 'sqf') are given as the third element of each tuple. Looking below at "non-raw f-strings" (normal f-strings) and "non-raw strings" (normal strings), their third elements are combined('fstringescape', <quote>) and combined('stringescape', <quote>), respectively. Furhter down we find the definition of fstringescape:

'fstringescape': [
    (r'\{\{', String.Escape),
    (r'\}\}', String.Escape),
    include('stringescape'),
],

i.e. fstringescape is just stringescape with the addition of {{ and }}.

We can try adding the missing third elements, resulting in the new code:

# raw f-strings
('(?i)(rf|fr)(""")',
 bygroups(String.Affix, String.Double),
 combined('fstringescape', 'tdqf')),
("(?i)(rf|fr)(''')",
 bygroups(String.Affix, String.Single),
 combined('fstringescape', 'tsqf')),
('(?i)(rf|fr)(")',
 bygroups(String.Affix, String.Double),
 combined('fstringescape', 'dqf')),
("(?i)(rf|fr)(')",
 bygroups(String.Affix, String.Single),
 combined('fstringescape', 'sqf')),

The rf'{{...}}' with backslashes will now lex without error. We've however lost the "raw-ness", as e.g. \n in rf'{{\n}}' will be understood as a formatted unit. The culprit is the normal stringescape included by fstringescape. We thus have to make a new such "escape" containing only {{ and }}, call it rfstringescape:

'rfstringescape': [
    (r'\{\{', String.Escape),
    (r'\}\}', String.Escape),
],

Now using

# raw f-strings
('(?i)(rf|fr)(""")',
 bygroups(String.Affix, String.Double),
 combined('rfstringescape', 'tdqf')),
("(?i)(rf|fr)(''')",
 bygroups(String.Affix, String.Single),
 combined('rfstringescape', 'tsqf')),
('(?i)(rf|fr)(")',
 bygroups(String.Affix, String.Double),
 combined('rfstringescape', 'dqf')),
("(?i)(rf|fr)(')",
 bygroups(String.Affix, String.Single),
 combined('rfstringescape', 'sqf')),

my raw f-strings containing backslashes within {{...}} seems to be lexed properly.

@opsocket
Copy link

Thank you for leading the way.

image

I have a version that more or less solves all these problems for all bytes and string literals, I still need to clean up a bit but results are great so far.

Updated examples from https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals.

image

@Anteru
Copy link
Collaborator

Anteru commented Jan 21, 2021

That looks quite promising -- looking forward to the PR!

@opsocket
Copy link

I also have nested lambdas in function definitions (IMHO the most trickiest token to catch)

image

As you can see, I've used some non standard types (about 5 or so) such as Name.Function.Parameter (fp class here) or Text.LineBreak(br class). That's what i need to clean up.

I'll send one when it's ready 🖖

@jmd-dk
Copy link
Contributor Author

jmd-dk commented Jan 21, 2021

Question

I wonder why Pygments (in the case of Python code) doesn't just use the built-in machinery of Python, where all of the tokenization/lexing etc. is fully exposed? Why reimplement all of this from scratch, which is obviously hard to get 100% right?

@opsocket
Copy link

opsocket commented Jan 21, 2021

@jmd-dk i dunno i was looking at the tokenize module to be honest.
Comparatively, retrieving tokens using the module and dispatching these to pygments internal tokens would greatly reduce the number of operations required for the same result.

image

@birkenfeld
Copy link
Member

birkenfeld commented Jan 21, 2021

@jmd-dk a highlighter is not the same as a language tokenizer.

tokenize's output (as you can see in TrippyHank's screenshot) is much less detailed than Pygments'. That's because many distinctions (e.g. between raw and non-raw strings) are handled later in the interpreter's pipeline.

On the other hand, Pygments needs to be more lenient since it doesn't only support the one version of the langauge it happens to run on: using tokenize, new syntax in 3.11 would fail to highlight if Pygments runs on 3.10.

@birkenfeld
Copy link
Member

I also have nested lambdas in function definitions (IMHO the most trickiest token to catch)

As you can see, I've used some non standard types (about 5 or so) such as Name.Function.Parameter (fp class here) or Text.LineBreak(br class). That's what i need to clean up.

I'll send one when it's ready vulcan_salute

Please see #1654 for why this is not a good idea.

@opsocket
Copy link

opsocket commented Jan 21, 2021

@birkenfeld Hmmm ..

a highlighter is not the same as a language tokenizer.

Pygments is literally tokenizing input text in order to output to other formats (HTML up here).

We're just talking about replacing the tokenizing process for Python (could be an opt-in) since there's a built-in module for that, that will:

  1. Simplify the code base
  2. Receive updates with python version changes
  3. Eliminate all the burden of tokenizing the whole python language by "hand" with regexes

Even tho, i have a pretty complete version that I'm very happy with right now. I know there might be some special edge case I've graciously missed because ey I've got only two hands and I'm living on a blue planet after all.

And i definitly don't care if this is either a good or a bad idea, if it works, it works 😌

@Anteru
Copy link
Collaborator

Anteru commented Jan 21, 2021

I do care if it's a good idea or a bad idea though. As @birkenfeld correctly points out, someone running Python 3.5 wouldn't get any Python 3.9 specific features highlighted, and that is in fact a blocker for us, because we want the same experience across all supported platforms. It also dramatically increases the testing matrix (it simply doubles it.) I'm not supportive of turning Pygments into a proper Python parser (no matter what underlying mechanism) and I'd rather live with some inaccuracies, but consistent behavior across all supported versions of Python.

@birkenfeld
Copy link
Member

a highlighter is not the same as a language tokenizer.
Pygments is literally tokenizing input text in order to output to other formats (HTML up here).

I was a bit ambiguous, for "language tokenizer" read "tokenizer used by a language interpreter". Of course both are tokenizing/lexing, that's in the name.

We're just talking about replacing the tokenizing process for Python (could be an opt-in) since there's a built-in module for that, that will:

1. Simplify the code base

In general, using a completely different strategy for one of the lexers compared to 99% of the others increases the bar for contributing. I know what I'm talking about, we have a few of them and I'd rather they get rewritten.

2. Receive updates with python version changes

As discussed before, this is nice, but the other direction breaks.

3. Eliminate all the burden of tokenizing the whole python language by "hand" with regexes

If you have a look at tokenize.py you'll get a small surprise regarding that...

Even tho, i have a pretty complete version that I'm very happy with right now. I know there might be some special edge case I've graciously missed because ey I've got only two hands and I'm living on a blue planet after all.

And i definitly don't care if this is either a good or a bad idea, if it works, it works

No problem with that, if you're happy with it by all means use it. But it's not going to fly in a PR here.

In any case, this is unrelated to the issue at hand - which has been fixed by #1683.

@opsocket
Copy link

opsocket commented Jan 21, 2021

I disagree it's incorrect, tokenize exists since at least 1992 for CPython. The output is "less expressive" (although there's more information included in a single token) for this very backward compatibility support thing.

@birkenfeld
Copy link
Member

birkenfeld commented Jan 21, 2021

I disagree it's incorrect, tokenize exists since at least 1992 for CPython. The output is less expressive for this very backward compatibility support thing.

Ok, I'll try once more: A tokenize-based Pygments running on Python 3.5 would never have highlighted f-strings correctly.

@opsocket
Copy link

opsocket commented Jan 21, 2021

You just keep repeating that but it's plain false. An f-string is tokenized as a string by tokenize.
The f-string content parsing (f-expression, etc..) is another (really simple) step (since you have the complete string token).

Really, i'm only trying to explain how it could work and you don't care. Kinda blindly without any piece of documentation at hand on why it wouldn't work. I get that if you evaluate the code, it's not going to work in 3.5 i mean, obviously.
But it's only to highlight text not to evaluate the code so a 3.5 could highlight python 3.9 valid code. The lexer would only have to map general tokenize tokens to more specific pygments tokens for further processing.

It doesn't have to interpret it.

@jmd-dk
Copy link
Contributor Author

jmd-dk commented Jan 21, 2021

... though I'm now slightly reconsidering whether I should have asked the question, I appreciate the answers and discussion 😄

@birkenfeld
Copy link
Member

birkenfeld commented Jan 21, 2021

You just keep repeating that but it's plain false. An f-string is tokenized as a string by tokenize. The f-string content parsing (f-expression, etc..) is another step.

~/devel/cpython [tags/3.5]> ./python --version
Python 3.5.10+
~/devel/cpython [tags/3.5]> ./python -m tokenize
a = f"a{b}c"
1,0-1,1:            NAME           'a'            
1,2-1,3:            OP             '='            
1,4-1,5:            NAME           'f'            
1,5-1,12:           STRING         '"a{b}c"'      
1,12-1,13:          NEWLINE        '\n'           
2,0-2,0:            ENDMARKER      ''             

The only ways out of this are a) ship tokenize.py with Pygments or b) add version-specific hacks. Neither of which makes the codebase simpler, and both still require version-specific updates in Pygments when syntax changes.

Independently, some common updates needed in the Python lexer are completely independent of what tokenize outputs, e.g. new builtins or exceptions.

Really, i'm only trying to explain how it could work and you don't care. Kinda blindly without any piece of documentation at hand on why it wouldn't work. I get that if you evaluate the code, it's not going to work in 3.5 i mean, obviously.

Nobody is talking about evaluation. It's not that I don't care, it's that I know what I'm talking about and I know a blocker when I see one.

@birkenfeld
Copy link
Member

birkenfeld commented Jan 21, 2021

@jmd-dk I'm sorry, I know this is off-topic here but we don't have a better venue for now...

@opsocket
Copy link

opsocket commented Jan 21, 2021

I mean with regex you double at least what you need to type for the same result because you have to create a specific state when "entering" a string on first quote (since we need the same closing one) if we want to catch specific sub tokens inside of it (f-expression, etc).

If it's not evaluation you're talking about i don't get you, because throwing python version incompatibility as a reason not to use tokenize is plain wrong.

A tokenize-based Pygments running on Python 3.5 would never have highlighted f-strings correctly

It's a childish mapping problem after tokenize stripped all things you don't give a damn about (i.e white spaces in sub structures like function definition or enclosures not matched by a global rule since we need exclusivity inside of it to match things differently).

@Anteru
Copy link
Collaborator

Anteru commented Jan 21, 2021

I don't think this discussion is leading anywhere. The arguments have been brought up, and our stance on this issue is clear. Let's move along here. If this is really important for you, you can always extend Pygments with a plugin that uses tokenize and use that exclusively for your Python lexing needs, but it's not something we will merge into Pygments upstream.

@pygments pygments locked as resolved and limited conversation to collaborators Jan 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-lexing area: changes to individual lexers good first issue Good for newcomers S-minor severity: minor T-bug type: a bug
Projects
None yet
Development

No branches or pull requests

4 participants