Overlapping token spans #1266

andrewbaxter · 2022-12-13T12:01:07Z

I was using code derived from the comment extraction snippet you left in another issue (not sure where now), and ended up with a panic since tokens from a token stream produced out of order token span bounds. FWIW there's other confusing bits about the output, so there might be something else going on here (like # tokens in the stream that don't appear in the source code).

I used this script to check for out of order line/columns in in this source file.

It basically checks if for any 2 consecutive line/columns, the former comes after the latter.

It fails on bin/lib.rs:

Ident { sym: pub, span: bytes(9527..9530) } - 368:0
Ident { sym: pub, span: bytes(9527..9530) } - 368:3
Ident { sym: struct, span: bytes(9531..9537) } - 368:4
Ident { sym: struct, span: bytes(9531..9537) } - 368:10
Ident { sym: FormatConfig, span: bytes(9538..9550) } - 368:11
Ident { sym: FormatConfig, span: bytes(9538..9550) } - 368:23
"{" - 368:24
"{" - 368:25
    Punct { char: '#', spacing: Alone, span: bytes(9557..9586) } - 369:4
    Punct { char: '#', spacing: Alone, span: bytes(9557..9586) } - 369:33
    "[" - 369:4
thread 'main' panicked at 'backwards!', src/bin/dump.rs:14:92
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

which is this section of the code

pub struct FormatConfig {
    /// Try to wrap to this width
    pub max_width: usize,
    /// If a node is split, all parents of the node must also be split
    pub root_splits: bool,
    /// If a node has comments, split it
    pub split_comments: bool,
}

(first line is 368)

So I guess

Are token spans supposed to be ordered? This may be a bad assumption on my part. They seem mostly ordered...
If not, is there a more reliable way to get comments?

Worst case, I could probably just ignore anything out of order. As long as it isn't the end of a non-whitespace text section it shouldn't cause fatal problems, maybe just losing some comments...

The text was updated successfully, but these errors were encountered:

andrewbaxter · 2022-12-13T13:01:41Z

Looking more closely, my best guess is that it's parsing /// and transforming it under the hood to #[doc="..."], so I'd guess it's making up spans for the generated tokens. Maybe this is relevant? #621

I couldn't quite figure out the boundaries in syn between verbatim and higher level parsing - some stylistically optional tokens are included (verbatim), but per this other things may undergo rather high level reinterpretations? How high-level is the parsing here supposed to be?

Is it possible to turn this off? Anyways, it seems like, since these are overlapping spans for a single synthetic element, it would be safe (as in, no panic) to just ignore line/cols that go backwards.

andrewbaxter · 2022-12-15T11:22:33Z

I ended up needing the original comments anyways so I basically

Parsed a token stream instead of the ast directly
Filtered while recursing the token stream - if a # punct token came up where the matching source was a /, skipped all tokens until the end of the # span.
Parsed the resulting token stream into the ast
After that the original comments were handled as normal

dtolnay · 2023-01-16T08:41:25Z

Glad that you figured it out!

andrewbaxter · 2023-01-16T08:52:19Z

Sorry if it wasn't explicit above, but there were three asks here:

Are there any other cases where spans from parsed source can be overlapping? This is critical for proper usage of the library and is missing from the documentation
Could a toggle be added to disable doc comment transformation during parsing?
Could you provide a mechanism or official recommendation for retrieving comments? You told someone else how to do that but your code doesn't work, and I feel like this is a fairly reasonable need for a parsing library (seeing as effort is taken to preserve other formatting information via spans)?

There was also a meta-ask about clarifying this project's stance on verbatim vs abstract parsing (i.e. on principal will it discard source information to normalize it, or will it try to preserve source information but end up with semantically identical source trees having different representations: final commas, doc comments, etc). Does this library have a principaled stance on this, or is it ad hoc?

I'm happy to create separate issues for these if that would help.

dtolnay · 2023-01-16T19:09:48Z

There aren't any guarantees made about that by this library.
It is not in scope for this library.
While it may be reasonable for another library to provide this, I don't have plans to do it as part of this library.
The principle is semantically distinct source trees will have distinct representations.

dtolnay closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlapping token spans #1266

Overlapping token spans #1266

andrewbaxter commented Dec 13, 2022 •

edited

andrewbaxter commented Dec 13, 2022

andrewbaxter commented Dec 15, 2022

dtolnay commented Jan 16, 2023

andrewbaxter commented Jan 16, 2023 •

edited

dtolnay commented Jan 16, 2023

Overlapping token spans #1266

Overlapping token spans #1266

Comments

andrewbaxter commented Dec 13, 2022 • edited

andrewbaxter commented Dec 13, 2022

andrewbaxter commented Dec 15, 2022

dtolnay commented Jan 16, 2023

andrewbaxter commented Jan 16, 2023 • edited

dtolnay commented Jan 16, 2023

andrewbaxter commented Dec 13, 2022 •

edited

andrewbaxter commented Jan 16, 2023 •

edited