Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Span to Tokens, AST nodes #161

Open
2 tasks
mullr opened this issue Apr 17, 2020 · 14 comments
Open
2 tasks

Add Span to Tokens, AST nodes #161

mullr opened this issue Apr 17, 2020 · 14 comments

Comments

@mullr
Copy link

mullr commented Apr 17, 2020

I'd like to use this library for some refactoring and analysis tools. It would be very helpful to have source line/column information available on the AST nodes, something along the lines of Span from syn. (https://docs.rs/syn/1.0.17/syn/spanned/trait.Spanned.html)

I see that line/column is already tracked in the tokenizer. I'm thinking something along the lines of:

struct LineColumn {
    pub line: usize,
    pub column: usize,
}

struct Span {
  pub start: LineColumn,
  pub end: LineColumn,
}

struct TokenOccurence {
  pub token: Token,
  pub span: Span
}

struct Query {
  pub span: Span,
  pub ctes: Vec<Cte>,
 ...
}

The main objection I can think of is that it some bulk to Tokens, and to each AST node. This could be alleviated, if desired, by hiding them behind a feature flag. I looked at syn to see how they do it; the approach there is very similar. Afict, their feature flagging doesn't actually affect the node representation / token representation. They just carry it all the time.

Since the implementation here doesn't appear to be hyper-optimized, and since it's good enough for syn, it seems like this approach should be okay.

If I submitted a PR along these lines, would you consider accepting it?

Related work:

@nickolay
Copy link
Collaborator

If I submitted a PR along these lines, would you consider accepting it?

I hope @andygrove and @benesch would comment.

As I'm also interested in analysis, I'd love something like this implemented.

Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)

A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.

@mullr
Copy link
Author

mullr commented Apr 21, 2020

Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)

I don't think it'll be too bad. I've hacked on this a little, and the initial approach I took was to exclude span fields from equality / hash computation, as they pretty clearly count as 'metadata' and not the data itself. But I haven't gotten as far as updating the test cases yet.

A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.

Their design goal of Syntax trees are lossless, or full fidelity. All comments and whitespace are preserved. is what I actually want in the end. With Span information, on both the token stream and the AST, I should be able to reverse out comment / ast node associations. But having an intermediate homogeneous tree (CST?) would be a much better option.

@mullr
Copy link
Author

mullr commented Apr 22, 2020

I've pushed up a 'span' branch on my fork that implements a small slice of this, to serve as discussion fodder: https://github.com/mullr/sqlparser-rs/commits/span

Adding Spans to the tokenizer out put was very easy. Getting them into the guts of the parser was a bit more, but straightforward. I added it to a single AST node (Ident), and I think I arrived at a pattern that isn't too disruptive, especially to the tests.

@andygrove
Copy link
Collaborator

Sorry, I probably won't have time to look at this until the weekend, but I will review this.

@nickolay
Copy link
Collaborator

The tests aren't affected, because of the #[derivative(PartialEq = "ignore")] on the new span member of Ident, right? Nice!

A few quick thoughts:

  • The changes introducing TokenOccurrence to the parser turn out to be more involved than I expected, in particular having to match on TokenOccurrence { token: Token::Word(ref k), .. } is unfortunate, but I don't have any ideas on how to improve it off-hand.
  • I now wonder what is the rule to determine if a span must be stored for a given bit of the AST. Take pub order_by: Vec<OrderByExpr> in Query. Would it be enough for the spans to be added only to existing nodes (Query and OrderByExpr)? What about pub rows: OffsetRows in Offset? pub asc: Option<bool> in OrderByExpr?
  • It seems that passing the spans explicitly throughout the parser would be the biggest pain, not the tests.

@mullr
Copy link
Author

mullr commented Apr 22, 2020

The changes introducing TokenOccurrence to the parser turn out to be more involved than I expected, in particular having to match on TokenOccurrence { token: Token::Word(ref k), .. } is unfortunate, but I don't have any ideas on how to improve it off-hand.

Yeah, it felt pretty bad writing that, but I don't have a better idea either.

I now wonder what is the rule to determine if a span must be stored for a given bit of the AST. Take pub order_by: Vec in Query. Would it be enough for the spans to be added only to existing nodes (Query and OrderByExpr)? What about pub rows: OffsetRows in Offset? pub asc: Option in OrderByExpr?

It's a tricky question, especially because the AST here is especially abstract. I would probably put it in places where there's already a struct that clearly corresponds to a piece of syntax, but wouldn't create new pieces of structure just to staple a Span on to.

In your examples: OrderByExpr, yes. Query, yes. Query.offset: yes, but because it's an Expr, and that would get a span. asc field of OrderByExpr: No.

A truly convincing answer to the question would probably require an intermediate representation, as discussed above.

It seems that passing the spans explicitly throughout the parser would be the biggest pain, not the tests.

Yep, that's likely where the bulk of the work is. "A pain" is a good way to describe it, as it should be fairly rote.

@mullr
Copy link
Author

mullr commented Apr 22, 2020

One thing to note: only leaf AST nodes actually require the span to be attached to them as a field. Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.

@nickolay
Copy link
Collaborator

Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.

I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY <expr> ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?

@mullr
Copy link
Author

mullr commented Apr 27, 2020

I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?

You're absolutely right, I hadn't thought of it that way.

@cswinter
Copy link
Contributor

Some way of accessing spans would also enable better error reporting in e.g. LocustDB since it makes it possible to point to specific spans within the original query string.

@datocrats-org
Copy link

datocrats-org commented Dec 3, 2021

I wanted to propose an approach that may be verbose but could help at least in testing and in deriving a way to output SQL from AST per #18 (which I think would require spans to tokens implement). My goal is to read how each portion of SQL is parsed and eventually rewritten, and to choose how any comments are maintained or removed during the rewrite. I'd like to be able to visualize them in a git diff like view comparing the old to new, just stored within the AST somehow, and giving me the opportunity to omit unnecessary extra comments if needed.

  • Keep the entire line to be parsed as input (comment included) and, repeat it only when moving to the next statement in the AST, flag as retain-original-sql
  • Optionally keep the line or line(s) before and after the line being parsed as input, flag as retain-prior-lines and retain-next-lines
  • Specify whether retain only the comment portion of those lines, flag as retain-comments-only or retain-all
  • When writing back out the SQL, include the line that was parsed as input directly above the statement a newly written version would have, just commented out
  • In order to distinguish any SQL written out's comments, let me create four dashes ---- instead of two -- to help distinguish what was a historical comment or /*/* instead of /*/* for multi-line comments, two fags retain-flag-prefix-single='--' and retain-flag-prefix-multi='/*'
    • This would allow users to specify their own method of distinguishing old comments from ones they may add manually when they open the generated script
    • A list of strings or regex patterns to find in comments that you wish to omit in the output retain-no-output=['completed issue num','commenter id char']

@aochagavia
Copy link

@alamb I saw you stepped up as a maintainer of the library and that you showed interest in getting #189 restarted. I'm not sure I'll have time to resurrect that PR, but just in case: is this issue still relevant to the project? And has anyone else worked on it since that PR?

@alamb
Copy link
Collaborator

alamb commented Mar 21, 2024

Thanks @aochagavia -- The issue is definitely still relevant. It is an often requested feature

It came up recently here, in fact: #839 (comment)

@alamb
Copy link
Collaborator

alamb commented Mar 21, 2024

I started collecting related work / tickets in the description of this ticket so we can track it more easily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants