Error parsing Python raw f-string with backslash #1681

jmd-dk · 2021-01-13T22:54:10Z

Python(3) code using raw f-strings containing a backslash within escaped braces (i.e. double braces) cannot be parsed. A minimum example is

fr'{{\S'

Some pointers:

Works: f'{{\S'.
Works: r'{{\S'
Works: rf'{{'.
Works: rf'\S'.
Also do not work: rf'{{\S'.
The S is not important, though some character must be there in order for the syntax to be valid.

This might seem like a pretty esoteric case. For me it comes up a lot when writing LaTeX in Matplotlib.

The text was updated successfully, but these errors were encountered:

jmd-dk · 2021-01-14T19:40:52Z

I think I now understand the problem and have the solution.

In lexers/python.py under # raw f-strings, only quotes ('tdqf', 'tsqf', 'dqf', 'sqf') are given as the third element of each tuple. Looking below at "non-raw f-strings" (normal f-strings) and "non-raw strings" (normal strings), their third elements are combined('fstringescape', <quote>) and combined('stringescape', <quote>), respectively. Furhter down we find the definition of fstringescape:

'fstringescape': [
    (r'\{\{', String.Escape),
    (r'\}\}', String.Escape),
    include('stringescape'),
],

i.e. fstringescape is just stringescape with the addition of {{ and }}.

We can try adding the missing third elements, resulting in the new code:

# raw f-strings
('(?i)(rf|fr)(""")',
 bygroups(String.Affix, String.Double),
 combined('fstringescape', 'tdqf')),
("(?i)(rf|fr)(''')",
 bygroups(String.Affix, String.Single),
 combined('fstringescape', 'tsqf')),
('(?i)(rf|fr)(")',
 bygroups(String.Affix, String.Double),
 combined('fstringescape', 'dqf')),
("(?i)(rf|fr)(')",
 bygroups(String.Affix, String.Single),
 combined('fstringescape', 'sqf')),

The rf'{{...}}' with backslashes will now lex without error. We've however lost the "raw-ness", as e.g. \n in rf'{{\n}}' will be understood as a formatted unit. The culprit is the normal stringescape included by fstringescape. We thus have to make a new such "escape" containing only {{ and }}, call it rfstringescape:

'rfstringescape': [
    (r'\{\{', String.Escape),
    (r'\}\}', String.Escape),
],

Now using

# raw f-strings
('(?i)(rf|fr)(""")',
 bygroups(String.Affix, String.Double),
 combined('rfstringescape', 'tdqf')),
("(?i)(rf|fr)(''')",
 bygroups(String.Affix, String.Single),
 combined('rfstringescape', 'tsqf')),
('(?i)(rf|fr)(")',
 bygroups(String.Affix, String.Double),
 combined('rfstringescape', 'dqf')),
("(?i)(rf|fr)(')",
 bygroups(String.Affix, String.Single),
 combined('rfstringescape', 'sqf')),

my raw f-strings containing backslashes within {{...}} seems to be lexed properly.

opsocket · 2021-01-21T06:40:32Z

Thank you for leading the way.

I have a version that more or less solves all these problems for all bytes and string literals, I still need to clean up a bit but results are great so far.

Updated examples from https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals.

Anteru · 2021-01-21T07:54:31Z

That looks quite promising -- looking forward to the PR!

opsocket · 2021-01-21T08:33:26Z

I also have nested lambdas in function definitions (IMHO the most trickiest token to catch)

As you can see, I've used some non standard types (about 5 or so) such as Name.Function.Parameter (fp class here) or Text.LineBreak(br class). That's what i need to clean up.

I'll send one when it's ready 🖖

jmd-dk · 2021-01-21T09:18:57Z

Question

I wonder why Pygments (in the case of Python code) doesn't just use the built-in machinery of Python, where all of the tokenization/lexing etc. is fully exposed? Why reimplement all of this from scratch, which is obviously hard to get 100% right?

opsocket · 2021-01-21T09:39:03Z

@jmd-dk i dunno i was looking at the tokenize module to be honest.
Comparatively, retrieving tokens using the module and dispatching these to pygments internal tokens would greatly reduce the number of operations required for the same result.

birkenfeld · 2021-01-21T10:19:27Z

@jmd-dk a highlighter is not the same as a language tokenizer.

tokenize's output (as you can see in TrippyHank's screenshot) is much less detailed than Pygments'. That's because many distinctions (e.g. between raw and non-raw strings) are handled later in the interpreter's pipeline.

On the other hand, Pygments needs to be more lenient since it doesn't only support the one version of the langauge it happens to run on: using tokenize, new syntax in 3.11 would fail to highlight if Pygments runs on 3.10.

birkenfeld · 2021-01-21T10:20:57Z

I also have nested lambdas in function definitions (IMHO the most trickiest token to catch)

As you can see, I've used some non standard types (about 5 or so) such as Name.Function.Parameter (fp class here) or Text.LineBreak(br class). That's what i need to clean up.

I'll send one when it's ready vulcan_salute

Please see #1654 for why this is not a good idea.

opsocket · 2021-01-21T10:48:01Z

@birkenfeld Hmmm ..

a highlighter is not the same as a language tokenizer.

Pygments is literally tokenizing input text in order to output to other formats (HTML up here).

We're just talking about replacing the tokenizing process for Python (could be an opt-in) since there's a built-in module for that, that will:

Simplify the code base
Receive updates with python version changes
Eliminate all the burden of tokenizing the whole python language by "hand" with regexes

Even tho, i have a pretty complete version that I'm very happy with right now. I know there might be some special edge case I've graciously missed because ey I've got only two hands and I'm living on a blue planet after all.

And i definitly don't care if this is either a good or a bad idea, if it works, it works 😌

Anteru · 2021-01-21T11:50:29Z

I do care if it's a good idea or a bad idea though. As @birkenfeld correctly points out, someone running Python 3.5 wouldn't get any Python 3.9 specific features highlighted, and that is in fact a blocker for us, because we want the same experience across all supported platforms. It also dramatically increases the testing matrix (it simply doubles it.) I'm not supportive of turning Pygments into a proper Python parser (no matter what underlying mechanism) and I'd rather live with some inaccuracies, but consistent behavior across all supported versions of Python.

birkenfeld · 2021-01-21T12:07:21Z

a highlighter is not the same as a language tokenizer.
Pygments is literally tokenizing input text in order to output to other formats (HTML up here).

I was a bit ambiguous, for "language tokenizer" read "tokenizer used by a language interpreter". Of course both are tokenizing/lexing, that's in the name.

We're just talking about replacing the tokenizing process for Python (could be an opt-in) since there's a built-in module for that, that will:
1. Simplify the code base

In general, using a completely different strategy for one of the lexers compared to 99% of the others increases the bar for contributing. I know what I'm talking about, we have a few of them and I'd rather they get rewritten.

2. Receive updates with python version changes

As discussed before, this is nice, but the other direction breaks.

3. Eliminate all the burden of tokenizing the whole python language by "hand" with regexes

If you have a look at tokenize.py you'll get a small surprise regarding that...

Even tho, i have a pretty complete version that I'm very happy with right now. I know there might be some special edge case I've graciously missed because ey I've got only two hands and I'm living on a blue planet after all.

And i definitly don't care if this is either a good or a bad idea, if it works, it works

No problem with that, if you're happy with it by all means use it. But it's not going to fly in a PR here.

In any case, this is unrelated to the issue at hand - which has been fixed by #1683.

opsocket · 2021-01-21T12:08:34Z

I disagree it's incorrect, tokenize exists since at least 1992 for CPython. The output is "less expressive" (although there's more information included in a single token) for this very backward compatibility support thing.

birkenfeld · 2021-01-21T12:11:02Z

I disagree it's incorrect, tokenize exists since at least 1992 for CPython. The output is less expressive for this very backward compatibility support thing.

Ok, I'll try once more: A tokenize-based Pygments running on Python 3.5 would never have highlighted f-strings correctly.

opsocket · 2021-01-21T12:16:46Z

You just keep repeating that but it's plain false. An f-string is tokenized as a string by tokenize.
The f-string content parsing (f-expression, etc..) is another (really simple) step (since you have the complete string token).

Really, i'm only trying to explain how it could work and you don't care. Kinda blindly without any piece of documentation at hand on why it wouldn't work. I get that if you evaluate the code, it's not going to work in 3.5 i mean, obviously.
But it's only to highlight text not to evaluate the code so a 3.5 could highlight python 3.9 valid code. The lexer would only have to map general tokenize tokens to more specific pygments tokens for further processing.

It doesn't have to interpret it.

jmd-dk · 2021-01-21T12:17:40Z

... though I'm now slightly reconsidering whether I should have asked the question, I appreciate the answers and discussion 😄

birkenfeld · 2021-01-21T12:30:11Z

You just keep repeating that but it's plain false. An f-string is tokenized as a string by tokenize. The f-string content parsing (f-expression, etc..) is another step.

~/devel/cpython [tags/3.5]> ./python --version
Python 3.5.10+
~/devel/cpython [tags/3.5]> ./python -m tokenize
a = f"a{b}c"
1,0-1,1:            NAME           'a'            
1,2-1,3:            OP             '='            
1,4-1,5:            NAME           'f'            
1,5-1,12:           STRING         '"a{b}c"'      
1,12-1,13:          NEWLINE        '\n'           
2,0-2,0:            ENDMARKER      ''

The only ways out of this are a) ship tokenize.py with Pygments or b) add version-specific hacks. Neither of which makes the codebase simpler, and both still require version-specific updates in Pygments when syntax changes.

Independently, some common updates needed in the Python lexer are completely independent of what tokenize outputs, e.g. new builtins or exceptions.

Really, i'm only trying to explain how it could work and you don't care. Kinda blindly without any piece of documentation at hand on why it wouldn't work. I get that if you evaluate the code, it's not going to work in 3.5 i mean, obviously.

Nobody is talking about evaluation. It's not that I don't care, it's that I know what I'm talking about and I know a blocker when I see one.

birkenfeld · 2021-01-21T12:32:03Z

@jmd-dk I'm sorry, I know this is off-topic here but we don't have a better venue for now...

opsocket · 2021-01-21T13:01:18Z

I mean with regex you double at least what you need to type for the same result because you have to create a specific state when "entering" a string on first quote (since we need the same closing one) if we want to catch specific sub tokens inside of it (f-expression, etc).

If it's not evaluation you're talking about i don't get you, because throwing python version incompatibility as a reason not to use tokenize is plain wrong.

A tokenize-based Pygments running on Python 3.5 would never have highlighted f-strings correctly

It's a childish mapping problem after tokenize stripped all things you don't give a damn about (i.e white spaces in sub structures like function definition or enclosures not matched by a global rule since we need exclusivity inside of it to match things differently).

Anteru · 2021-01-21T13:44:45Z

I don't think this discussion is leading anywhere. The arguments have been brought up, and our stance on this issue is clear. Let's move along here. If this is really important for you, you can always extend Pygments with a plugin that uses tokenize and use that exclusively for your Python lexing needs, but it's not something we will merge into Pygments upstream.

Anteru added T-bug type: a bug A-lexing area: changes to individual lexers S-minor severity: minor labels Jan 14, 2021

Anteru changed the title ~~[Bug] Error parsing Python raw f-string with backslash~~ Error parsing Python raw f-string with backslash Jan 14, 2021

Anteru added the good first issue Good for newcomers label Jan 14, 2021

jmd-dk mentioned this issue Jan 14, 2021

Fix for lexing Python raw f-strings with backslashes #1683

Merged

birkenfeld closed this as completed Jan 21, 2021

pygments locked as resolved and limited conversation to collaborators Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error parsing Python raw f-string with backslash #1681

Error parsing Python raw f-string with backslash #1681

jmd-dk commented Jan 13, 2021

jmd-dk commented Jan 14, 2021 •

edited

opsocket commented Jan 21, 2021

Anteru commented Jan 21, 2021

opsocket commented Jan 21, 2021

jmd-dk commented Jan 21, 2021

opsocket commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021

opsocket commented Jan 21, 2021 •

edited

Anteru commented Jan 21, 2021

birkenfeld commented Jan 21, 2021

opsocket commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

jmd-dk commented Jan 21, 2021

birkenfeld commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

Anteru commented Jan 21, 2021 •

edited

Error parsing Python raw f-string with backslash #1681

Error parsing Python raw f-string with backslash #1681

Comments

jmd-dk commented Jan 13, 2021

jmd-dk commented Jan 14, 2021 • edited

opsocket commented Jan 21, 2021

Anteru commented Jan 21, 2021

opsocket commented Jan 21, 2021

jmd-dk commented Jan 21, 2021

Question

opsocket commented Jan 21, 2021 • edited

birkenfeld commented Jan 21, 2021 • edited

birkenfeld commented Jan 21, 2021

opsocket commented Jan 21, 2021 • edited

Anteru commented Jan 21, 2021

birkenfeld commented Jan 21, 2021

opsocket commented Jan 21, 2021 • edited

birkenfeld commented Jan 21, 2021 • edited

opsocket commented Jan 21, 2021 • edited

jmd-dk commented Jan 21, 2021

birkenfeld commented Jan 21, 2021 • edited

birkenfeld commented Jan 21, 2021 • edited

opsocket commented Jan 21, 2021 • edited

Anteru commented Jan 21, 2021 • edited

jmd-dk commented Jan 14, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

birkenfeld commented Jan 21, 2021 •

edited

opsocket commented Jan 21, 2021 •

edited

Anteru commented Jan 21, 2021 •

edited