Use generic string states in Python lexer #1477

pyrmont · 2020-04-05T08:42:10Z

Python allows for a variety of string literals (formatted, raw, unicode) as well as byte literals. In addition, strings can be delimited by ', ", ''' and """. At present, the Python lexer contains multiple states to handle the supported combination. This approach is duplicative, error-prone and doesn't scale.

This PR takes a different approach. A StringRegister class is added to the Python lexer that is used to hold the stack of string literals currently being lexed. Using this approach, it is possible to implement a series of generic string states and apply the appropriate tokens with reference to this register.

This PR fixes #937 and fixes #942 (or that's the goal at least).

pyrmont · 2020-04-05T08:44:22Z

This is lexing the visual sample correctly but would appreciate one or more sets of eyes that have more experience with Python (e.g. @aldanor, @dvf).

Oh, and @jneen, do you see anything horribly wrong with this solution?

pyrmont · 2020-04-05T08:49:58Z

Oh, and I didn't understand why the previous rules excluded newlines and percentages from the :strings_single and :strings_double states. If someone can explain that, that'd be great. For example, in the :strings_single state:

rule %r/[^\\'%\n]+/, Str

I also didn't understand the reason for this rule in the :strings state:

rule %r/%(\([a-z0-9_]+\))?[-#0 +]*([0-9]+|[*])?(\.([0-9]+|[*]))?/i, Str::Interpol

Python allows for a variety of string literals (formatted, raw, unicode) as well as byte literals. In addition, strings can be delimited by `'`, `"`, `'''` and `"""`. At present, the Python lexer contains multiple states to handle the supported combination. This approach is duplicative, error-prone and doesn't scale. This commit takes a different approach. A `StringRegister` class is added to the Python lexer and used to record the stack of string literals currently being lexed. This makes it possible to implement generic string states and then apply the appropriate tokens based on the state of the register.

pyrmont added the needs-review The PR needs to be reviewed label Apr 5, 2020

pyrmont self-assigned this Apr 5, 2020

pyrmont added 2 commits April 10, 2020 13:33

Add f-string examples to visual sample

f88ec01

Use generic string states

1465a97

pyrmont force-pushed the bugfix.python-generic-strings branch from a862ad8 to 1465a97 Compare April 10, 2020 04:33

pyrmont added 2 commits April 10, 2020 13:38

Make StringRegister class private

90f0583

Change syntax of string register API

9859938

pyrmont merged commit 652a622 into rouge-ruby:master Apr 14, 2020

pyrmont deleted the bugfix.python-generic-strings branch April 14, 2020 07:10

pyrmont removed the needs-review The PR needs to be reviewed label Apr 14, 2020

tuxu mentioned this pull request Apr 24, 2020

Python lexer broken for raw strings in v3.18.0 #1507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use generic string states in Python lexer #1477

Use generic string states in Python lexer #1477

pyrmont commented Apr 5, 2020 •

edited

pyrmont commented Apr 5, 2020

pyrmont commented Apr 5, 2020

Use generic string states in Python lexer #1477

Use generic string states in Python lexer #1477

Conversation

pyrmont commented Apr 5, 2020 • edited

pyrmont commented Apr 5, 2020

pyrmont commented Apr 5, 2020

pyrmont commented Apr 5, 2020 •

edited