Add Span to Tokens, AST nodes #161

mullr · 2020-04-17T19:41:31Z

I'd like to use this library for some refactoring and analysis tools. It would be very helpful to have source line/column information available on the AST nodes, something along the lines of Span from syn. (https://docs.rs/syn/1.0.17/syn/spanned/trait.Spanned.html)

I see that line/column is already tracked in the tokenizer. I'm thinking something along the lines of:

struct LineColumn {
    pub line: usize,
    pub column: usize,
}

struct Span {
  pub start: LineColumn,
  pub end: LineColumn,
}

struct TokenOccurence {
  pub token: Token,
  pub span: Span
}

struct Query {
  pub span: Span,
  pub ctes: Vec<Cte>,
 ...
}

The main objection I can think of is that it some bulk to Tokens, and to each AST node. This could be alleviated, if desired, by hiding them behind a feature flag. I looked at syn to see how they do it; the approach there is very similar. Afict, their feature flagging doesn't actually affect the node representation / token representation. They just carry it all the time.

Since the implementation here doesn't appear to be hyper-optimized, and since it's good enough for syn, it seems like this approach should be okay.

If I submitted a PR along these lines, would you consider accepting it?

Related work:

The text was updated successfully, but these errors were encountered:

nickolay · 2020-04-21T14:52:47Z

If I submitted a PR along these lines, would you consider accepting it?

I hope @andygrove and @benesch would comment.

As I'm also interested in analysis, I'd love something like this implemented.

Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)

A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.

mullr · 2020-04-21T18:28:52Z

Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)

I don't think it'll be too bad. I've hacked on this a little, and the initial approach I took was to exclude span fields from equality / hash computation, as they pretty clearly count as 'metadata' and not the data itself. But I haven't gotten as far as updating the test cases yet.

A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.

Their design goal of Syntax trees are lossless, or full fidelity. All comments and whitespace are preserved. is what I actually want in the end. With Span information, on both the token stream and the AST, I should be able to reverse out comment / ast node associations. But having an intermediate homogeneous tree (CST?) would be a much better option.

mullr · 2020-04-22T16:54:33Z

I've pushed up a 'span' branch on my fork that implements a small slice of this, to serve as discussion fodder: https://github.com/mullr/sqlparser-rs/commits/span

Adding Spans to the tokenizer out put was very easy. Getting them into the guts of the parser was a bit more, but straightforward. I added it to a single AST node (Ident), and I think I arrived at a pattern that isn't too disruptive, especially to the tests.

andygrove · 2020-04-22T16:57:01Z

Sorry, I probably won't have time to look at this until the weekend, but I will review this.

nickolay · 2020-04-22T17:43:52Z

The tests aren't affected, because of the #[derivative(PartialEq = "ignore")] on the new span member of Ident, right? Nice!

A few quick thoughts:

The changes introducing TokenOccurrence to the parser turn out to be more involved than I expected, in particular having to match on TokenOccurrence { token: Token::Word(ref k), .. } is unfortunate, but I don't have any ideas on how to improve it off-hand.
I now wonder what is the rule to determine if a span must be stored for a given bit of the AST. Take pub order_by: Vec<OrderByExpr> in Query. Would it be enough for the spans to be added only to existing nodes (Query and OrderByExpr)? What about pub rows: OffsetRows in Offset? pub asc: Option<bool> in OrderByExpr?
It seems that passing the spans explicitly throughout the parser would be the biggest pain, not the tests.

mullr · 2020-04-22T21:46:40Z

The changes introducing TokenOccurrence to the parser turn out to be more involved than I expected, in particular having to match on TokenOccurrence { token: Token::Word(ref k), .. } is unfortunate, but I don't have any ideas on how to improve it off-hand.

Yeah, it felt pretty bad writing that, but I don't have a better idea either.

I now wonder what is the rule to determine if a span must be stored for a given bit of the AST. Take pub order_by: Vec in Query. Would it be enough for the spans to be added only to existing nodes (Query and OrderByExpr)? What about pub rows: OffsetRows in Offset? pub asc: Option in OrderByExpr?

It's a tricky question, especially because the AST here is especially abstract. I would probably put it in places where there's already a struct that clearly corresponds to a piece of syntax, but wouldn't create new pieces of structure just to staple a Span on to.

In your examples: OrderByExpr, yes. Query, yes. Query.offset: yes, but because it's an Expr, and that would get a span. asc field of OrderByExpr: No.

A truly convincing answer to the question would probably require an intermediate representation, as discussed above.

It seems that passing the spans explicitly throughout the parser would be the biggest pain, not the tests.

Yep, that's likely where the bulk of the work is. "A pain" is a good way to describe it, as it should be fairly rote.

mullr · 2020-04-22T21:48:24Z

One thing to note: only leaf AST nodes actually require the span to be attached to them as a field. Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.

nickolay · 2020-04-23T00:31:13Z

Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.

I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY <expr> ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?

mullr · 2020-04-27T14:47:37Z

I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?

You're absolutely right, I hadn't thought of it that way.

cswinter · 2020-05-25T15:19:28Z

Some way of accessing spans would also enable better error reporting in e.g. LocustDB since it makes it possible to point to specific spans within the original query string.

datocrats-org · 2021-12-03T14:38:35Z

I wanted to propose an approach that may be verbose but could help at least in testing and in deriving a way to output SQL from AST per #18 (which I think would require spans to tokens implement). My goal is to read how each portion of SQL is parsed and eventually rewritten, and to choose how any comments are maintained or removed during the rewrite. I'd like to be able to visualize them in a git diff like view comparing the old to new, just stored within the AST somehow, and giving me the opportunity to omit unnecessary extra comments if needed.

Keep the entire line to be parsed as input (comment included) and, repeat it only when moving to the next statement in the AST, flag as retain-original-sql
Optionally keep the line or line(s) before and after the line being parsed as input, flag as retain-prior-lines and retain-next-lines
Specify whether retain only the comment portion of those lines, flag as retain-comments-only or retain-all
When writing back out the SQL, include the line that was parsed as input directly above the statement a newly written version would have, just commented out
In order to distinguish any SQL written out's comments, let me create four dashes ---- instead of two -- to help distinguish what was a historical comment or /*/* instead of /*/* for multi-line comments, two fags retain-flag-prefix-single='--' and retain-flag-prefix-multi='/*'
- This would allow users to specify their own method of distinguishing old comments from ones they may add manually when they open the generated script
- A list of strings or regex patterns to find in comments that you wish to omit in the output retain-no-output=['completed issue num','commenter id char']

aochagavia · 2024-03-21T15:03:00Z

@alamb I saw you stepped up as a maintainer of the library and that you showed interest in getting #189 restarted. I'm not sure I'll have time to resurrect that PR, but just in case: is this issue still relevant to the project? And has anyone else worked on it since that PR?

alamb · 2024-03-21T17:32:56Z

Thanks @aochagavia -- The issue is definitely still relevant. It is an often requested feature

It came up recently here, in fact: #839 (comment)

alamb · 2024-03-21T17:33:47Z

I started collecting related work / tickets in the description of this ticket so we can track it more easily

This was referenced May 29, 2020

Comment handling #175

Open

Add line/column information consistently to every parser message. #179

Open

This was referenced Jun 7, 2020

Add some benchmarks #188

Closed

Lossless syntax tree prototype and discussion #189

Closed

wangrunji0408 mentioned this issue Feb 26, 2024

feat(sql-udf): better hint display for invalid sql udf definition risingwavelabs/risingwave#15091

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Span to Tokens, AST nodes #161

Add Span to Tokens, AST nodes #161

mullr commented Apr 17, 2020 •

edited by alamb

nickolay commented Apr 21, 2020

mullr commented Apr 21, 2020 •

edited

mullr commented Apr 22, 2020

andygrove commented Apr 22, 2020

nickolay commented Apr 22, 2020

mullr commented Apr 22, 2020

mullr commented Apr 22, 2020

nickolay commented Apr 23, 2020

mullr commented Apr 27, 2020

cswinter commented May 25, 2020

datocrats-org commented Dec 3, 2021 •

edited

aochagavia commented Mar 21, 2024

alamb commented Mar 21, 2024

alamb commented Mar 21, 2024

Add Span to Tokens, AST nodes #161

Add Span to Tokens, AST nodes #161

Comments

mullr commented Apr 17, 2020 • edited by alamb

nickolay commented Apr 21, 2020

mullr commented Apr 21, 2020 • edited

mullr commented Apr 22, 2020

andygrove commented Apr 22, 2020

nickolay commented Apr 22, 2020

mullr commented Apr 22, 2020

mullr commented Apr 22, 2020

nickolay commented Apr 23, 2020

mullr commented Apr 27, 2020

cswinter commented May 25, 2020

datocrats-org commented Dec 3, 2021 • edited

aochagavia commented Mar 21, 2024

alamb commented Mar 21, 2024

alamb commented Mar 21, 2024

mullr commented Apr 17, 2020 •

edited by alamb

mullr commented Apr 21, 2020 •

edited

datocrats-org commented Dec 3, 2021 •

edited