Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping token spans #1266

Closed
andrewbaxter opened this issue Dec 13, 2022 · 5 comments
Closed

Overlapping token spans #1266

andrewbaxter opened this issue Dec 13, 2022 · 5 comments

Comments

@andrewbaxter
Copy link

andrewbaxter commented Dec 13, 2022

I was using code derived from the comment extraction snippet you left in another issue (not sure where now), and ended up with a panic since tokens from a token stream produced out of order token span bounds. FWIW there's other confusing bits about the output, so there might be something else going on here (like # tokens in the stream that don't appear in the source code).

I used this script to check for out of order line/columns in in this source file.

It basically checks if for any 2 consecutive line/columns, the former comes after the latter.

It fails on bin/lib.rs:

Ident { sym: pub, span: bytes(9527..9530) } - 368:0
Ident { sym: pub, span: bytes(9527..9530) } - 368:3
Ident { sym: struct, span: bytes(9531..9537) } - 368:4
Ident { sym: struct, span: bytes(9531..9537) } - 368:10
Ident { sym: FormatConfig, span: bytes(9538..9550) } - 368:11
Ident { sym: FormatConfig, span: bytes(9538..9550) } - 368:23
"{" - 368:24
"{" - 368:25
    Punct { char: '#', spacing: Alone, span: bytes(9557..9586) } - 369:4
    Punct { char: '#', spacing: Alone, span: bytes(9557..9586) } - 369:33
    "[" - 369:4
thread 'main' panicked at 'backwards!', src/bin/dump.rs:14:92
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

which is this section of the code

pub struct FormatConfig {
    /// Try to wrap to this width
    pub max_width: usize,
    /// If a node is split, all parents of the node must also be split
    pub root_splits: bool,
    /// If a node has comments, split it
    pub split_comments: bool,
}

(first line is 368)

So I guess

  1. Are token spans supposed to be ordered? This may be a bad assumption on my part. They seem mostly ordered...
  2. If not, is there a more reliable way to get comments?

Worst case, I could probably just ignore anything out of order. As long as it isn't the end of a non-whitespace text section it shouldn't cause fatal problems, maybe just losing some comments...

@andrewbaxter
Copy link
Author

Looking more closely, my best guess is that it's parsing /// and transforming it under the hood to #[doc="..."], so I'd guess it's making up spans for the generated tokens. Maybe this is relevant? #621

I couldn't quite figure out the boundaries in syn between verbatim and higher level parsing - some stylistically optional tokens are included (verbatim), but per this other things may undergo rather high level reinterpretations? How high-level is the parsing here supposed to be?

Is it possible to turn this off? Anyways, it seems like, since these are overlapping spans for a single synthetic element, it would be safe (as in, no panic) to just ignore line/cols that go backwards.

@andrewbaxter
Copy link
Author

I ended up needing the original comments anyways so I basically

  1. Parsed a token stream instead of the ast directly
  2. Filtered while recursing the token stream - if a # punct token came up where the matching source was a /, skipped all tokens until the end of the # span.
  3. Parsed the resulting token stream into the ast
  4. After that the original comments were handled as normal

@dtolnay
Copy link
Owner

dtolnay commented Jan 16, 2023

Glad that you figured it out!

@dtolnay dtolnay closed this as completed Jan 16, 2023
@andrewbaxter
Copy link
Author

andrewbaxter commented Jan 16, 2023

Sorry if it wasn't explicit above, but there were three asks here:

  1. Are there any other cases where spans from parsed source can be overlapping? This is critical for proper usage of the library and is missing from the documentation
  2. Could a toggle be added to disable doc comment transformation during parsing?
  3. Could you provide a mechanism or official recommendation for retrieving comments? You told someone else how to do that but your code doesn't work, and I feel like this is a fairly reasonable need for a parsing library (seeing as effort is taken to preserve other formatting information via spans)?

There was also a meta-ask about clarifying this project's stance on verbatim vs abstract parsing (i.e. on principal will it discard source information to normalize it, or will it try to preserve source information but end up with semantically identical source trees having different representations: final commas, doc comments, etc). Does this library have a principaled stance on this, or is it ad hoc?

I'm happy to create separate issues for these if that would help.

@dtolnay
Copy link
Owner

dtolnay commented Jan 16, 2023

  1. There aren't any guarantees made about that by this library.
  2. It is not in scope for this library.
  3. While it may be reasonable for another library to provide this, I don't have plans to do it as part of this library.
  4. The principle is semantically distinct source trees will have distinct representations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants