Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use generic string states in Python lexer #1477

Merged
merged 4 commits into from Apr 14, 2020

Conversation

pyrmont
Copy link
Contributor

@pyrmont pyrmont commented Apr 5, 2020

Python allows for a variety of string literals (formatted, raw, unicode) as well as byte literals. In addition, strings can be delimited by ', ", ''' and """. At present, the Python lexer contains multiple states to handle the supported combination. This approach is duplicative, error-prone and doesn't scale.

This PR takes a different approach. A StringRegister class is added to the Python lexer that is used to hold the stack of string literals currently being lexed. Using this approach, it is possible to implement a series of generic string states and apply the appropriate tokens with reference to this register.

This PR fixes #937 and fixes #942 (or that's the goal at least).

@pyrmont pyrmont added the needs-review The PR needs to be reviewed label Apr 5, 2020
@pyrmont pyrmont self-assigned this Apr 5, 2020
@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 5, 2020

This is lexing the visual sample correctly but would appreciate one or more sets of eyes that have more experience with Python (e.g. @aldanor, @dvf).

Oh, and @jneen, do you see anything horribly wrong with this solution?

@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 5, 2020

Oh, and I didn't understand why the previous rules excluded newlines and percentages from the :strings_single and :strings_double states. If someone can explain that, that'd be great. For example, in the :strings_single state:

rule %r/[^\\'%\n]+/, Str

I also didn't understand the reason for this rule in the :strings state:

rule %r/%(\([a-z0-9_]+\))?[-#0 +]*([0-9]+|[*])?(\.([0-9]+|[*]))?/i, Str::Interpol

@pyrmont pyrmont force-pushed the bugfix.python-generic-strings branch from a862ad8 to 1465a97 Compare April 10, 2020 04:33
@pyrmont pyrmont merged commit 652a622 into rouge-ruby:master Apr 14, 2020
@pyrmont pyrmont deleted the bugfix.python-generic-strings branch April 14, 2020 07:10
@pyrmont pyrmont removed the needs-review The PR needs to be reviewed label Apr 14, 2020
mattt pushed a commit to NSHipster/rouge that referenced this pull request May 21, 2020
Python allows for a variety of string literals (formatted, raw,
unicode) as well as byte literals. In addition, strings can be
delimited by `'`, `"`, `'''` and `"""`. At present, the Python lexer
contains multiple states to handle the supported combination. This
approach is duplicative, error-prone and doesn't scale.

This commit takes a different approach. A `StringRegister` class is
added to the Python lexer and used to record the stack of string
literals currently being lexed. This makes it possible to implement
generic string states and then apply the appropriate tokens based on
the state of the register.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full support of Python strings literals Python f-strings not highlighted (PEP 498)
1 participant