Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements #2383 Add syntax modes for FreeMarker template language #2847

Merged
merged 1 commit into from
Jan 14, 2022

Conversation

blutorange
Copy link
Contributor

@blutorange blutorange commented Dec 21, 2021

Woah, test cases got really large, much larger than the implementation...

Adds syntax highlighting support for Apache FreeMarker, resolves #2383. Includes tests and samples for the website.

  • Language ID is freemarker2 since FreeMarker 3 will most likely change the syntax
  • FreeMarker actually defines 6 slightly different syntaxes. Therefore this commit adds the 6 language modes freemarker2.tag-*.interpolation-*. It also adds the mode freemarker2, which is an alias for freemarker2.tag-angle.interpolation-dollar, the default mode when using FreeMarker via the Java API.
  • If anybody has a better suggestion for naming the modes, feel free to suggest it.

(from the source code comment)

The grammar for FreeMarker 2.x. This tokenizer is intentionally limited to FreeMarker 2 as the next release FreeMarker 3 is a breaking change that will change the syntax, see:
https://cwiki.apache.org/confluence/display/FREEMARKER/FreeMarker+3

FreeMarker does not just have one grammar, it has 6 (!) different syntaxes.

  • 3 possibilities for the tag syntax: angle, bracket, auto
  • 2 possibilities for the interpolation syntax: dollar, bracket

These can be combined, resulting in 3*2=6 syntaxes. There's another tag syntax, but that one is legacy and therefore ignored by this tokenizer.

  • Angle tag syntax is like <#if true>...</#if>
  • Bracket tag syntax is like [#if true]...[/#if]
  • Auto tag syntax inspects the first directive and uses that.

Dollar interpolation syntax is like ${1+2}, bracket syntax like [=1+2].

To prevent duplicate code, there are factory functions that take a syntax mode and dynamically create the tokenizer for that mode. This does not affect performance since the tokenizer is created only once.

Auto mode is implemented via parser states. Each parser state exists three times, one for each tag syntax mode (e.g. default.auto, default.angle, default.bracket). Auto mode starts in default.auto and switches to default.angle or default.bracket when it encounters the first directive.

FreeMarker allows expressions within strings (a${1+2}b), but these are impossible to tokenize. String interpolation is not implemented via a mode change when encountering ${. Rather, FreeMarker tokenizes the string as a literal string first. Then, during the AST build phase, it creates a new parses and parses the unescaped string content.

This is adapted from the official JavaCC grammar for FreeMarker: https://github.com/apache/freemarker/blob/2.3-gae/src/main/javacc/FTL.jj

Taken from the above file, a short rundown of the basic parser states:

The lexer portion defines 5 lexical states:
DEFAULT, FM_EXPRESSION, IN_PAREN, NO_PARSE, and EXPRESSION_COMMENT.
The DEFAULT state is when you are parsing
text but are not inside a FreeMarker expression.
FM_EXPRESSION is the state you are in
when the parser wants a FreeMarker expression.
IN_PAREN is almost identical really. The difference
is that you are in this state when you are within
FreeMarker expression and also within (...).
This is a necessary subtlety because the
">" and ">=" symbols can only be used
within parentheses because otherwise, it would
be ambiguous with the end of a directive.
So, for example, you enter the FM_EXPRESSION state
right after a ${ and leave it after the matching }.
Or, you enter the FM_EXPRESSION state right after
an "<if" and then, when you hit the matching ">"
that ends the if directive,
you go back to DEFAULT lexical state.
If, within the FM_EXPRESSION state, you enter a
parenthetical expression, you enter the IN_PAREN
state.
Note that whitespace is ignored in the
FM_EXPRESSION and IN_PAREN states
but is passed through to the parser as PCDATA in the DEFAULT state.
NO_PARSE and EXPRESSION_COMMENT are extremely simple
lexical states. NO_PARSE is when you are in a comment
block and EXPRESSION_COMMENT is when you are in a comment
that is within an FTL expression.

It should be noted that there are another parser state not mentioned in the above excerpt: NO_DIRECTIVE is used as the initial starting state when parsing the contents of a string literal, which is allowed to contain interpolations, but no directives. However, note that FreeMarker first tokenizes a string literal as-is, then during the parsing stage, it takes the (unescaped) content of the string literal, and tokenizes + parses that content with a new child parser.

@melloware
Copy link

This is fantastic. I have been testing it with my real world FreeMarker templates and its working exactly the way I would expect it to!

@FlipWarthog
Copy link

Holy cow, this is amazing. Well done, @blutorange !

@hediet
Copy link
Member

hediet commented Jan 3, 2022

Thank you for this PR! It seems like you put a lot of effort into this.

Generally, it looks really good. However, I did not review every single line of those added ~25k lines of code, especially not the tests. At ten lines per second, a "proper" review would take more than 40 minutes.

@alexdima what do you think how we should proceed here?

@alexdima
Copy link
Member

This looks very good and thorough, thank you very much!

@alexdima alexdima merged commit 93c7165 into microsoft:main Jan 14, 2022
@alexdima alexdima added this to the January 2022 milestone Jan 14, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Feb 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apache FreeMarker Language Support
5 participants