[TypeScript] Add methods for BufferedTokenStream #4569

Phlosioneer · 2024-03-31T21:15:09Z

The typescript class was almost completely empty, so it was not possible to directly use the token stream in a type-safe way.

The typescript class was almost completely empty, so it was not possible to directly use the token stream in a type-safe way. Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>

ericvergnaud · 2024-04-01T08:00:12Z

runtime/JavaScript/src/antlr4/BufferedTokenStream.d.ts

+    nextTokenOnChannel(i: number, channel: number): number;
+    previousTokenOnChannel(i: number, channel: number): number;
+
+    protected lazyInit(): void;


Why do you need these protected members ?

adjustSeekIndex is required for subclassing BufferedTokenStream, and it only exists for subclasses to use. See CommonTokenStream::adjustSeekIndex for an example.

lazyInit is needed if the subclass tries to access any of the fields/properties safely, possibly before one of the BufferedTokenStream methods is called. Technically more protected functions are available, like setup, but these two were the minimum required to correctly subclass the stream. Technically all the fields should be marked protected but that felt too restrictive, because the missing underscore convention (_index, _tokens) indicates the fields are public in JS.

I've actually made a mistake, there should be an additional index: number line. I'll push that to the PR branch tomorrow.

But why do you need to subclass BufferedTokenStream ?

Because I needed a token stream that did extra processing/rewriting, between the lexer and parser steps. It's a normal/supported thing to do, the Parser accepts any TokenStream as input.

Isn't that normally done by a TokenStreamRewriter, designed just for that purpose ?

TokenStreamRewriter is made with a C-style preprocessor in mind. Its output is text after rewrites have happened. It's not able to e.g. transform tokens, merge them together or split them apart, add context to tokens (technically subclass, but JS makes that very blurry), and generally fails to act as a stream transformer between the lexer and parser. You would need to run a file through a lexer, then the token stream rewriter, then through a different lexer, then through a parser. I'm not dealing with a preprocessor, I need a token stream transformer that outputs a stream of tokens.

I think this is a supported use case; otherwise there would be no need for a distinction between CommonTokenStream and BufferedTokenStream.

To ground this discussion, we're discussing two protected annotations in a typescript definition file as a companion to the already-implemented, already-publicly-accessible javascript functions. The ship has already sailed about whether these functions were a "good" interface or if the buffer is a "good" extension point. These functions are already part of the public API on every other language runtime except typescript.

Also, it should be noted that other runtimes put lazyInit as protected, never private.

Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.

Ah, I didn't know antlr was getting another rewrite. I'll detail my use case, perhaps it will help. However, yesterday I rewrote the whole parser & lexer from scratch and achieved much better results than antlr was able to provide. This is for a personal project. Read only if you're interested, nothing here directly bears on the current protected member discussion.

The DSL I made has identifiers / names of the form foo.bar.baz, dot-separated paths. It also allows a shorthand starting with a dot, .bar.baz, that meant the identifier was relative to some other identifier. There are block statements like with foo { ... .bar.baz ... } that should result in foo.bar.baz at the end of parsing.

OK, all fairly normal so far. The issue is that with blocks could be anywhere in the file. Inside any block or at the top level or nested within other with statements. I tried to spec this out in the parser's .g4 but my simple ~60 line parser ballooned into a few hundred lines, to account for all the possible places where with could happen; and the parser-listener became similarly complex. And that was still with a subset of the intended DSL.

What I really needed was something like with_generic<inner_parser>: 'with' (IDENT | REL_IDENT) '{' inner_parser* '}'; but that didn't seem possible. So I tried to move the handling of with into its own parser / transformer. The rule is extremely simple if stated as its own grammar; it's nearly find-and-replace. I tried to use the provided stream-rewriter, but the translation back to text destroyed metadata that I was tracking in the tokens, particularly the IDENT / REL_IDENT tokens, about the path they represent. I can see the utility of that rewriter for large complex transformations, but this task is extremely simple.

So I wrote a class that took in a token stream, located with blocks, and emitted corrected IDENT tokens. No new tokens were added to the stream, so the line, column, and character indexes of each token were still valid. The text was also still valid; the relative-to information was stored as a separate field. But then I ran into the above issue, where the Parser would not accept just any input stream, only a buffered input stream; and the buffered input stream wouldn't accept an arbitrary token stream, just a Lexer. So I made adjustments until it worked.

The DSL is something I made and could adjust, but it's being used as a game/puzzle element directly, so I couldn't just remove or dramatically change how the with blocks worked. What eventually drove me to rewrite the lexer & parser was frustration over the lexer not tokenizing some IDENT tokens (and not understanding how to properly handle error messages). Turns out, the grammar was actually ambiguous to lex, but I only understood that after rewriting. The lexer and parser ended up being less code, though; the full power and flexibility of antlr wasn't needed. The DSL is LL(1) with finite depth, so the state machine is extremely simple. And with access to generic functions, the with situation could be handled simply in the parser with a single rule-function (parseWith<T, U>(innerParser: (data: U) => T, data: U): void) invoked within the generic block rule-function (parseBlock<T, U>(statementParser: (data: U) => T, data: U): void which corresponds to antlr pseudocode parse_block<inner>: '{' (with_block<inner> | inner)* '}';). The most complex thing I had to do was implementing a prefix tree for lexing keywords. I probably could have just done a hash map with string keys but it was a fun challenge.

I took a look at the antlr5 repo. I like the concept. Though webassembly seems to be splitting in two with the WASI stuff, even two very similar runtimes is far better than 10 unrelated ones. I'll poke around a bit, maybe try to get involved.

Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>

BufferedTokenStream's input was changed from Lexer to TokenSource, but TokenSource was empty. BufferedTokenStream expects two functions to exist on any TokenSource, getSourceName and nextToken. Parser also expects a _factory field, though the TokenFactory type doesn't exist yet in typescript. It also isn't used for anything within antlr4's runtime, though it may be used by generated files? These three entries are a subset of the Java runtime's TokenSource interface. Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>

Phlosioneer · 2024-04-02T20:38:12Z

I've added the index field, and also added a minimal interface for TokenSource. This is needed because I switched the input field's type from Lexer to TokenSource without checking whether TokenSource had the needed methods.

ericvergnaud · 2024-04-06T13:54:04Z

runtime/JavaScript/src/antlr4/TokenSource.d.ts

@@ -1,3 +1,7 @@
-export declare class TokenSource {


mmm... not sure how this can work... the JS TokenSource does not implement these fields and methods.
I'd suggest making it a TS interface implemented by Lexer.
please bring this to a dedicated PR

Ok, will do. For this PR I'll put BufferedTokenStream back to "tokenSource: Lexer", that was type safe though limiting.

ericvergnaud · 2024-04-06T13:55:16Z

runtime/JavaScript/src/antlr4/BufferedTokenStream.d.ts

+    nextTokenOnChannel(i: number, channel: number): number;
+    previousTokenOnChannel(i: number, channel: number): number;
+
+    protected lazyInit(): void;


Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.

Add methods for BufferedTokenStream

e31915c

The typescript class was almost completely empty, so it was not possible to directly use the token stream in a type-safe way. Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>

ericvergnaud reviewed Apr 1, 2024

View reviewed changes

Phlosioneer added 2 commits April 2, 2024 15:22

Add missing index field to buffer class

1f04204

Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>

ericvergnaud requested changes Apr 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TypeScript] Add methods for BufferedTokenStream #4569

[TypeScript] Add methods for BufferedTokenStream #4569

Phlosioneer commented Mar 31, 2024

ericvergnaud Apr 1, 2024 •

edited

Phlosioneer Apr 2, 2024

ericvergnaud Apr 2, 2024

Phlosioneer Apr 2, 2024

ericvergnaud Apr 2, 2024

Phlosioneer Apr 2, 2024

Phlosioneer Apr 2, 2024 •

edited

ericvergnaud Apr 6, 2024

Phlosioneer Apr 6, 2024 •

edited

Phlosioneer commented Apr 2, 2024

ericvergnaud Apr 6, 2024

Phlosioneer Apr 6, 2024

ericvergnaud Apr 6, 2024

[TypeScript] Add methods for BufferedTokenStream #4569

Are you sure you want to change the base?

[TypeScript] Add methods for BufferedTokenStream #4569

Conversation

Phlosioneer commented Mar 31, 2024

ericvergnaud Apr 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Phlosioneer Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Phlosioneer Apr 6, 2024 • edited

Choose a reason for hiding this comment

Phlosioneer commented Apr 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericvergnaud Apr 1, 2024 •

edited

Phlosioneer Apr 2, 2024 •

edited

Phlosioneer Apr 6, 2024 •

edited