New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impl PartialOrd for Cursor #1196
Conversation
(I'm up to date with master, so I don't think those CI failures are me? I cannot reproduce locally.) |
I am on board with a PartialOrd impl, but I don't think this implementation works. With a
|
Oh, I misunderstood how the cursor implementation works when I skimmed lightly. (Mainly because since |
487c114
to
58cc1fd
Compare
Given the tree nature of the TokenBuffer, there's no clever way to compare cursors directly. As such, the simplest and perhaps only way works out: just keep track of how many entries we've walked past, and compare that count. If you give me a pointer on where/how to add a test, I'll add a couple. |
src/buffer.rs
Outdated
fn partial_cmp(&self, other: &Self) -> Option<cmp::Ordering> { | ||
let Cursor { index, scope, .. } = self; | ||
if *scope == other.scope { | ||
Some(index.cmp(&other.index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It does look odd to PartialEq
over (ptr, scope)
and PartialOrd
over (index, scope)
, but this should be consistent.)
I am not pleased that this implementation adds bookkeeping overhead to every cursor operation (which underlies all parsing) even in programs that never use Speculative. That seems like the wrong tradeoff. Would you be able to come up with any other ways to implement this using only the existing data structures? |
I'm not super happy with it either. At the same time, though, this is a fairly small cost, so it might not be problematic? I think the only solution for comparing |
The reason I don't think anything better is possible is that cursors can be arbitrarily many The best middle ground I can think of is to carry along a pointer to the base stream, compare that, and only walk if they have the same base pointer. (This had a different idea previously but that was broken.) |
In practice they are not. You should be able to efficiently compute the depth of a cursor without using |
Well, I've found a sufficiently clever solution, I think. Clever enough that I'm not 100% confident betting on it without some good tests. But I think this handles all cases without falling back to counting cases to test# both rooted
« ∘ ● ∘ ● ∘ »
# rooted, group before
« ∘ « ∘ ● ∘ » ∘ ● ∘ »
# rooted, group after
« ∘ ● ∘ « ∘ ● ∘ » ∘ »
# rooted, group at eof
« ∘ ● ∘ « ∘ ● ∘ » »
# different group
« ∘ « ∘ ● ∘ » ∘ « ∘ ● ∘ » ∘ »
# increased nesting
« ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ »
# all of the rooted cases, but in a group ...but this also lets me realize that this doesn't actually fully solve the problem that I want to solve, because if two speculative Well, just call that a limitation. Even with my PEG-generating-macro, I still intend to also encourage enums to be hand-parsed in LL(3), rather than relying on PEG. (Also, |
If this implementation looks okay, I'll replace |
// If we don't have the same scope end, the cursors are incomparable. | ||
if self.scope != other.scope { | ||
return None; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I didn't realize this is hard to handle. I wonder if this makes the whole impl fruitless for your use case. For example your "rooted, group before" or "different group" test cases, but with parens instead of None groups:
# rooted, group before
∘ ( ∘ ● ∘ ) ∘ ● ∘
# different group
∘ ( ∘ ● ∘ ) ∘ ( ∘ ● ∘ ) ∘
One would probably require the first black dot to compare less than the second black dot, but with this impl a < b
and b < a
would both be false.
One would need to know to write a < b && !(b < a)
in order to get an actual meaningful comparison, or a.partial_cmp(b)
directly.
Can we just safely keep going with both cursors regardless of scope? The very very outermost Entry::End
contains ptr::null(), so we can look for that, instead of paying attention to scope
at all. Basically my mental model is this impl shouldn't care at all about any distinction between non-None groups, explicitly entered None groups, and implicitly entered None groups. The only point of scope
is distinguishing the implicitly entered None case from the other two (if I'm recalling correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I realize the objection: that's inconsistent with PartialEq. If we ignore scope, PartialEq could say cursors are not equal while PartialOrd could say they are equal (same ptr, different scope).
I wonder what could possibly go wrong if PartialEq looked at ptr only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay the existing PartialEq impl for Cursor is probably just wrong and bad. The only place in our codebase that uses it assumes it works the other way, and crashes with the current impl.
// [dependencies]
// syn = { version = "1", features = ["full", "extra-traits"] }
// proc-macro2 = "1"
// quote = "1"
use proc_macro2::{Delimiter, Group};
use quote::quote;
fn main() {
// Okay. Rustc allows top-level `static` with no value syntactically, but
// not semantically. Syn parses as Item::Verbatim.
let tokens = quote!(pub static FOO: usize; pub static BAR: usize;);
let file = syn::parse2::<syn::File>(tokens).unwrap();
println!("{:#?}", file);
// Okay.
let inner = Group::new(Delimiter::None, quote!(static FOO: usize = 0; pub static BAR: usize = 0));
let tokens = quote!(pub #inner;);
let file = syn::parse2::<syn::File>(tokens).unwrap();
println!("{:#?}", file);
// Parser crash.
let inner = Group::new(Delimiter::None, quote!(static FOO: usize; pub static BAR: usize));
let tokens = quote!(pub #inner;);
let file = syn::parse2::<syn::File>(tokens).unwrap();
println!("{:#?}", file);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I "fix" equality to only care about ptr
as well? This is safe, comparing cursor positions ignoring "partial eof" does seem generally more useful (and certainly is for the "which cursor is further along" query1).
If we do decide to do comparison ignoring "partial eof," I think the routine should be
- If
self.ptr == other.ptr
, returnEqual
. - Walk
self.ptr
toself.scope
, recording traversed ranges.- If
other.ptr
is in these ranges, returnLess
. (This step is purely an early-out optimization.)
- If
- Walk
other.ptr
toother.scope
, recording traversed ranges (iffself.scope != other.scope
).- If we enter
self
's traversed ranges, returnself_range.start <=> other_range.start
. - If
self.scope == other.scope
,unreachable!()
.
- If we enter
- Walk
self.scope
toEnd(nullptr)
, recording traversed ranges (iffother.scope != End(nullptr)
).- If we enter
other
's traversed ranges, returnself_range.start <=> other_range.start
.
- If we enter
- Walk
other.scope
toEnd(nullptr)
.- If we enter
self
's traversed ranges, returnself_range.start <=> other_range.start
.
- If we enter
- return
Incomparable
.
since compared cursors being in the same scope is still a common case worth optimizing for, I think.
This is IIUC self
all the way first).
More in the style of big-$O$, that's Cursor
is penalizing normal LL(3)
consumers for niche consumers who want to compare cursors.
Just counting tokens-to-end is
The interesting thing to note, though, is that what is actually wanted is [Cursor]::sort
, not Cursor::partial_ord
, and comparing for sort
can skip recalculating the walk-to-End(nullptr)
list for each cursor on each sort. Though IIUC this won't show up in big-$O$, it could be a significant improvement in practice.
Footnotes
-
Although it turns out my use case doesn't have that cursor; a parse stream's cursor is IIUC in an unspecified state after parse error and in practice is after the block that the error occurred in, not at the position of the error. Providing the cursor with the error is impossible since
Error: 'static
(and that's a good thing; much better than the relatively minor annoyance of ). ↩ -
e.g. a flat representation that makes cursor comparison purely
self.ptr <=> other.ptr
. ↩ -
e.g. an index of tokens yielded since the beginning of the root buffer, and a pointer to the root buffer to check they're comparable if you want to not just give consistent-but-meaningless results for comparing cursors in different root
TokenBuffer
s. ↩ -
The exceptions also being somewhat mitigated by early-exit, since when $d \not\ll n$, much of the nesting is likely shared. Exceptions are constructed and not likely in practice. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I "fix" equality to only care about ptr as well?
Yes I think we should change this. I would be shocked if anyone had found a use for the current behavior, and I would be shocked if changing it wouldn't fix a significant fraction of places the impl is used, much like it fixes our own buggy use.
As for tests: someday we should organize those but for now you can create a new top-level integration test file in the tests dir. Thanks! |
Sorry for stumbling in with an unexpectedly difficult problem 🙃 I honestly originally thought that I think I've determined my original use case isn't served by I've got some time this weekend to pursue whichever option you think is most worth trying. (Walk-to- After writing out this and the inline comment, I'm actually quite partial to trying for a flat tree repr... Footnotes
|
I'd be on board with pursuing a flat representation. I think your assessment is correct that it has odds in favor of performance positive, and obviously it makes PartialOrd easy and fast. I am also open to exploring possible ways to get a more meaningful cursor from a failed parse on a fork. |
Oh for the flat representation work: make sure to fetch from master since I was poking around in TokenBuffer construction recently, since it had to be paged in for reviewing this PR anyway. |
Closing in favor of the much simpler #1223 |
This is primarily intended for use with
discouraged::Speculative
parsing in order to implement one of the suggested error strategies, namely to show the error from the speculative parse which consumed the most tokens.Example use case
I am implementing a macro which allows declaring a syntax tree enum and automatically provides a PEG-style parse implementation which tries each variant in order. This is, in fact, the express reason I added
advance_to
back in 2019; I guess it took me a while to actually go back and try to finish the experiment.Yes, I know that syn is deliberately biased to LL(3) parsing. In fact, my original macro was a much more complicated grammar specification to allow it to describe LL(3) lookahead, but that complexity came at a cost of being painful to work both on and with. The intent of the new vastly simplified macro just using PEG is to use declarative PEG where it is enough, but still allow (and encourage) writing and using manual syntax elements mixed in with the declarative elements. Additionally, PEG exhibits much the same performance characteristics as LL(3) when the described grammar is in fact LL(3), as each speculative parse terminates within the three leading tokens, though it does still suffer from worse errors.
The currently emitted parse implementation is
Of note is the failure case:
In order to determine how far the forked
ParseBuffer
got, the first instinct would be to compare the.span()
s. However, comparing spans doesn't work in proc macros, and does the wrong thing anyway, since the spans could have arbitrary relations due to macro expansion. So I necessarily fall back to the only other option: counting the remaining unparsed tokens. With theCursor
API, this is most easily done by creating a new.token_stream()
, but could also potentially be done by counting the length of an iterator constructed from successively using.token_tree()
to get an advanced cursor.However, the
Cursor
implementation has the needed information to say whichCursor
is more advanced already, via the buffer entry pointers. ExposingCursor
ordering in this manner allows determining which speculative parse got the furthest just by comparing the cursors:An alternative is to add an API directly to
ParseStream
which allows comparing them, but it can't be part ofSpeculative
, because we didn't seal the trait.