Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement stricter env filter parsing #1542

Closed
wants to merge 2 commits into from

Conversation

CAD97
Copy link
Contributor

@CAD97 CAD97 commented Sep 8, 2021

Motivation

I'm working on a system that records tracing events in-process and displays them in a window, allowing filtering with EnvFilter-like syntax. As such I needed to parse the syntax, and the current parsing routine in tracing-subscriber is... subpar to say the least. Due to the way the regex are written/used, you can pretty trivially construct directive strings with very odd parses (e.g. [[[span[[] which parses as [span], or [{{{a}}}] which parses as []).

As such, I wrote a stricter parser for the format for my own use, and this PR polishes it up to the level to use it as the env filter parser for tracing.

Also, fixes #1584.

Solution

Implement a better parser and use it!


Important breaking behavior change note

Beyond the obvious changes in how things parse, there's a subtle one: [{name="bob"}]. Previously, this would parse as a directive to match field name against the regex "bob". Now, as quotes are semantic, it parses as a directive to match field name against the regex bob. Also note that filter regexes are anchored on both sides: this is a big behavior change!

For how this corresponds to captures, as of master 0fa74b9:

  • tracing::*_span!("f", name="bob") (captured by-Value) matches the regex bob and pretty displays as name: "bob",
  • tracing::*_span!("f", name=%"bob") (captured by-Display) matches the regex bob and pretty displays as name: bob,
  • tracing::*_span!("f", name=?"bob") (captured by-Debug) matches the regex "bob" and pretty displays as name: "bob", and
  • #[tracing::instrument] fn f(name: &str); f("bob") matches the regex bob.

In all cases, the obnoxiously pretty formatter says in f with name: "bob" (i.e. when printed no visual difference is made between the methods of capture).

Since string arguments are already captured by-Value and matched without strings in master, this will make filters with "bob" go from matching spans capturing strings by-Debug (rare?) to capturing strings by-anything-else (common). If they want the current behavior, they would have to use a regex hex escape currently.

Plus, if I'm not mistaken, pre #1378, argument strings were captured by-Debug, not by-Value, so they would need to be matched as "bob". This means #1378 broke behavior there, and this PR returns to "bob" matching an argument captured string.

Test changes

  • tracing_subscriber::filter::env::directive::parse_directives_with_special_characters_in_span_name: "/=[} are now forbidden in (unquoted) span names.
  • tracing_subscriber::filter::env::callsite_enabled_includes_span_directive_multiple_fields: previously, the directive didn't parse at all (EnvFilter cannot parse filters with multiple fields #1584) and thus the test was completely broken; now the directive parses and the interest is sometimes (which I think is correct).
  • tracing_subscriber::filter::env::roundtrip: previously, the directive was [span1{foo=1}]=error,[span2{bar=2 baz=false}],crate2[{quux="quuux"}]=debug, which included the problematic directive [{bar=2 baz=false}]; this has been changed to [{bar=2, baz=false}], with the alternative which would continue to parse successfully being [{bar="2 baz=false"}] or [{"bar=2 baz"=false}].
  • New tests in tracing_subscriber::filter::env::parsing.

Remaining work / decision items

  • More tests, always.
  • Should quoted field value directives always behave as a regex match instead of a value match? This seems reasonable to me, but as currently implemented, quotes are purely a syntax escape feature, and have no semantic meaning (i.e. val="3" is equivalent to val=3, and performs a structural match). Of note is that all structurally matched values can be spelled without reserved syntax, though this may potentially change with more structural captures.
  • Quoted directive fields currently have no escape syntax -- every character is literal other than ", which is disallowed (ends the quoted field).
  • Mark me as CODEOWNER for tracing_subscriber::filter::env::parsing

Future work

  • Use /? It's reserved currently as it is used by env_logger to specify a (global) regex filter for log messages.
  • Support multiple span filters in a single directive? What semantics would that have?
  • Autoref specialization in #[tracing::instrument] to mitigate the big behavior change, discussed above.
  • Make it possible (somehow) to use EnvFilter to filter third-party things that "quack like SpanRef<'_>/Event<'_>". This is actually my motivating factor in writing a new parser, to implement env-filter like filtering of serialized events/spans.
Original PR message

parse-env-filter is essentially part of this PR (github); I can inline the implementation here if desired, but I'll also be using the implementation separately, so it'd be nicer for me if it can live in a dedicated crate, either within this repo or in a repo I own.

Remaining work

This isn't quite mergable in its current state yet. Known deficiencies that should be fixed before merging:

  • The current implementation always splits the env filter string into multiple directives on each ,. (This means a parsed env filter can't support multiple field bounds, even though Directive nominally supports it! Whoops...) While parse-env-filter properly handles comma-separated field directives (and span directives, which would be a new feature), this PR currently still does the top-level split(',') for convenience of implementation.
  • parse-env-filter currently completely bails on a directive containing the " (or /) character. This is because I haven't (but plan to) implement some form of quoting for the fields of the directive, so that the fields may contain syntax characters within the quotes. The exact quoting syntax needs to be decided, and the implementation work needs to happen. / is reserved because of env_logger's syntax; this one might be desirable to keep as an error because of that?
  • parse-env-filter doesn't do any error recovery; it just bails upon bad syntax and calls the rest of the directive string invalid. The current impl can recover, as every , is treated as the start of a new directive. Allowing , inside a directive, mismatched brackets, and more, causes error recovery to continue parsing directives a much harder task, though. parse-env-filter's behavior matches try_new's, but new's behavior of "best effort parse" is at odds with parse-env-filter's goal of more strict, well-defined parsing.
  • parse-env-filter deliberately does no whitespace trimming; whitespace is provided as part of the directives. EnvFilter needs to decide if it wants to trim whitespace, and do so if it wants to.

Known test failures

  • filter::env::directive::test::parse_directives_with_special_characters_in_span_name: special characters include " (parser support todo) and [/{///, which are deliberately disallowed. How to handle "quoting should be decided, I need to then implement"handling, and the forbidden nature of[/{is an ideological question:parse-env-filter` aims to be strict and validate, whereas the current impl is more forgiving and loose with the filter syntax.
  • filter::env::tests::callsite_enabled_includes_span_directive_field: test includes quoted field value (parser support todo)
  • filter::env::tests::roundtrip: test includes quoted field value (parser support todo), and the formatting of the directive which is being checked to roundtrip includes the quotes as well, so this is more inherent a problem than the previous tests

r? @hawkw

Immediate discussion items:

  • Decide what (if any) quoting syntax with " is desired for env filter directives
  • Decide where on the strict-permissive parser axis we want EnvFilter to fall
    • Decide how EnvFilter::new should do recovery along that axis
  • Decide on how to treat /
  • Decide if we actually want multiple span/field directives in a single directive (and what the semantics of it are, though I'm here for the parser, not the filtering)
    • Multiple spans: target[span1,span2]=level
    • Multiple fields: target[span{field1=value1, field2=value2}]
  • Decide what (if any) whitespace trimming (e.g. after ,) should be done

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 8, 2021

This is, obviously, a breaking behavior change as it stands. Also fun: no more regex or lazy_static dependencies for tracing-subscriber! It's a simple format, so a carefully hand-rolled parser is perfectly acceptable (and potentially preferable).

@hawkw
Copy link
Member

hawkw commented Sep 8, 2021

Thanks for working on this, this is great! The filter-parsing code is definitely my least favorite part of tracing-subscriber, and it's great to see some improvement in that area.

I'll try to give a more thorough review of the new code soon, but for now, here are some thoughts on the "immediate discussion items":

  • Decide what (if any) quoting syntax with " is desired for env filter directives

Quotes are currently used to delimit string field values. We should probably allow other reserved characters (e.g. =, {/}, [,], ,, and /) while inside of a quote pair, because otherwise, it wouldn't be possible to filter on string values including those characters...

  • Decide where on the strict-permissive parser axis we want EnvFilter to fall
    • Decide how EnvFilter::new should do recovery along that axis

The current behavior of EnvFilter::new, skipping invalid directives individually and trying to recover by parsing the rest of the filter string, is based on what env_logger does by default; I wrote it that way because I figured it was what users coming from env_logger would expect by default. I'm not particularly attached to this, but I think it's probably preferable to continue providing both a strict and permissive API. Ideally, we'd continue to have an EnvFilter::new that attempts to recover from individual invalid directives when possible, and have a try_new that fails if any directive is malformed. I'm not sure if we can always recover from single invalid directives, though, because we should allow commas in quoted strings in field value filter expressions...

  • Decide on how to treat /

As I mentioned above, we probably want to allow it inside of quoted string values. We should probably treat it as reserved if it's not inside a quoted string.

  • Decide if we actually want multiple span/field directives in a single directive (and what the semantics of it are, though I'm here for the parser, not the filtering)
    • Multiple spans: target[span1,span2]=level
    • Multiple fields: target[span{field1=value1, field2=value2}]

We currently have filtering semantics for spans with multiple field values, so we should continue to accept that. We don't have semantics for filters with multiple span names in them, currently, so the parser should reject that.

Decide what (if any) whitespace trimming (e.g. after ,) should be done

Hmm...targets generally won't have whitespace in them, as they're usually Rust module paths. However, they can be overridden with arbitrary user-provided strings, which could (theoretically) contain whitespace, including leading whitespace. This is kind of an edge case, though.

Therefore, I think we should probably trim whitespace after ,,so that users can write whitespace-delimited filter strings, like foo=info, foo::bar=trace or whatever, and get "foo::bar" (a target that actually exists) rather than " foo::bar" (with leading whitespace that won't match the actual module path). We may want to allow quoted strings with arbitrary characters as targets, as well as field values, so that if users need leading whitespace (or other special characters), they could write foo=info," target with leading whitespace"=trace, in the rare occasion that they actually want that.

@hawkw
Copy link
Member

hawkw commented Sep 8, 2021

Regarding where the code should go,

parse-env-filter is essentially part of this PR (github); I can inline the implementation here if desired, but I'll also be using the implementation separately, so it'd be nicer for me if it can live in a dedicated crate, either within this repo or in a repo I own.

I'd really prefer for the filter-parsing code to be part of the tracing repo, ideally in the tracing-subscriber crate --- I'd prefer to not depend on an external crate for that, even if it is also owned by the tracing repo.

However, since you wrote it (and certainly have a deeper understanding of it than I do), I'd love for you to be added to the .github/CODEOWNERS file for the env-filter parsing code. That way, you'll automatically be tagged as a reviewer for any future changes to the code you contributed. Of course, I'd also add you as a repo contributor so that you can actually approve or request changes on those reviews. How does that sound?

Since you'd like to use the parser implementation separately in your own code, is it possible that there are additional API changes we could make in tracing-subscriber so that you could depend on an implementation in the tracing-subscriber crate for your own purposes as well? Or, is it really important for it to live in a separate crate?

As a side note:

I'm working on a system that records tracing events in-process and displays them in a window, allowing filtering with EnvFilter-like syntax.

This sounds really cool! When your project is ready to release, we'd love it if you added it to the list of related crates, and maybe even write a blog post or something to show it off, if you're interested?

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 9, 2021

I'd really prefer for the filter-parsing code to be part of the tracing repo, [...] I'd love for you to be added to the .github/CODEOWNERS file for the env-filter parsing code

Sounds good to me.

is it possible that there are additional API changes we could make in tracing-subscriber so that you could depend on an implementation in the tracing-subscriber crate for your own purposes as well?

Ideally, I'd be able to just use EnvFilter, actually. The API to support that would be EnvFilter::directives(&self) -> impl Iterator<Item=&Directive> (easy) and Directive::includes(&self, impl EventLike) -> bool (more difficult). The open question there is how to structure trait EventLike such that the directive can filter both tracing::Event<'_> and my own tracing-memory::Event, which is effectively a serialized event.

More details about tracing-memory and tracing-egui tracing-memory: a layer that archives event/span data so the event history of the runtime can be inspected. tracing-egui: an [egui](http://lib.rs/egui) widget that displays the recorded event history in an egui context and allows filtering.

An additional wrinkle on top is that it'd need to use a trait SpanLike as well (rather than tracing::Span or SpanRef), since I serialize the span data out as well. It'd be cool if I could store Span instead, and just use the existing data store, but there's two main reasons I haven't put more effort into pursuing that approach:

  • It'd keep other data in Extensions that tracing-memory doesn't know about alive as well, which is unfortunate, and more notably,
  • There's no current way to go from SpanRef<'_> and Context to Span, because upgrading Context's borrowed collector to an owned collector would be difficult (though possible, with e.g. ArcBorrow and an enum for if it's the &'static collector).

Those two combined have made me settle on just serializing the field data of spans. (I'm just holding &'static Metadata though, so no duplication there.)

This sounds really cool! When your project is ready to release, we'd love it if you added it to the list of related crates, and maybe even write a blog post or something to show it off, if you're interested?

Definitely! tracing-egui definitely should live as a separate project, but I'd even be super happy to contribute tracing-memory into tracing proper as a shared in-memory archive. Though... I also need to check out how you're recording events in tokio-console as well, because there's a very good chance you're cleverer than I am here.

Either way, the data console is capturing has significant overlap with what I'm doing, so it makes sense to see what implementation overlap there is and try to exploit that, at least from my side of the deal. The "only" difference with what I'm trying out and console is that console is sending the observability data into a separate process, where I'm just trying to capture a nicely structured and searchable in-app debug log for my hobbyist game engine. (External tools are great, but with gamedev tight iteration cycles having a simple version in-app is often better, so long as it can be no-overhead for release.)

Quotes are currently used to delimit string field values.

This is going to sound very patronizing, but.... trying it out, it doesn't look like this is strictly the case, at least as I want to interpret what you're saying. e.g. I'm running the test case (with crates-io tracing) of

    let filter =
        tracing_subscriber::EnvFilter::try_new(r#"playground[span{name=bob}]=warn"#).unwrap();

    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer().pretty())
        .with(filter)
        .init();

    let span = tracing::error_span!("span", name = "bob");
    span.in_scope(|| {
        tracing::error!(name = "bob", "oh no");
    });

which logs the event, as

  Sep 08 19:25:23.736 ERROR playground: oh no, name: "bob"
    at src\main.rs:14
    in playground::span with name: "bob"

whereas the filter directive of playground[span{name="bob"}]=warn does not log the event. What's interesting, though, is that if I instead do

    // ...
    span("bob");
}

#[tracing::instrument]
fn span(name: &str) {
    tracing::error!(name = "bob", "oh no");
}

then the polarity is flipped; {name=bob} doesn't match, but {name="bob"} does.

For the full array,

let name = "bob";
let span = tracing::error_span!("span", /* what goes here */);
  • name => {name=bob}
  • %name => {name=bob}
  • ?name => {name="bob"}

and in all cases the log says

    in playground::span with name: "bob"

At least with this investigation I know I'm not seeing things; the handling of " in the current Directive/field::Match is.... just to pass it along to matchers as-is, thus matching the " as part of (whatever form of) the value it ends up matching against.

Based on observed effects, it seems that a captured-by-Value or captured-by-Display str gets displayed-by-Debug, whereas a captured-by-Debug str gets displayed-by-Display-of-Debug. However, they're filtered differently: the former gets filtered-by-Display, and the latter gets filtered-by-Display-of-Debug. This leads to the strange (dare I say inconsistent?) behavior around quotes that threw me for a loop.

We should probably allow other reserved characters (e.g. =, {/}, [,], ,, and /) while inside of a quote pair

Do we want any way to escape quotes within the quoted directive field, or is just quote-to-quote enough? Current implementation: just passes the field string from = to } to matchers, unaltered. So, "foo"oof" gets turned into a match, unaltered.

There's also the question of whether non-prefix-quotes should be allowed or considered syntax errors.

I a few reasonable paths forward, though I don't really like any of them:

  • Just drop the reservation of ". You can't write {}[]/, as part of a directive field, but behavior of field="foo" is identical to what it is today.
  • Allow quoting; field ends at second "; directive string is passed unchanged to matchers. You can't write " as part of a directive field, but the behavior of field="foo" is identical to what it is today (that is to say, imho oddly inconsistent).
  • Allow quoting; allow interior escapes (either as \"+\\ (bad w.r.t. regex) or "" (unfamiliar but not without precedent)); pass a processed string (without our escape syntax, or exterior quotes) to matchers.
    • Don't change the behavior of matching w.r.t. quotes. Imho this is the worst option; would require writing e.g. {name="\"bob\""} to get the current behavior of {name="bob"}, and {name="bob"} would be equivalent to {name=bob} where it wasn't previously, for questionable gain.
    • Change the behavior of matching w.r.t. quotes. I think the "best" answer here is to employ autoref specialization to capture &str arguments to a tracing::instrumented function by-Value rather than by-Debug.
      • This has the upside of, that while {name="bob"} changes meaning, it still matches the same calls to #[tracing::instrument) fn foo(name: &str), due to "stripping the quotes off" of both the filter and the filtered.
      • Is this done for other Visit-enabled primitives? Or are they actually captured by-Debug by tracing::instrument? (Quick testing results: neither {name=true} nor {name="true"} allowed the span through; {name} did; it's almost certainly being captured by-Debug then the filter is trying to filter it by-primitive (which it isn't, at this point) 😅) Most of my diagnosis here is still speculation because I've only dug into parsing code, not any of the display or capturing code. If we do it for &str we should do it for every Visit primitive.

I kinda lied when I said I don't like any of the options, or I figured out the last one while I was writing that, one of the two. The last option seems actually not bad, just a decent amount of work (not all of which is parsing input, which has accidentally become my specialty apparently).

But... since unquoted strings are a thing at all, we need to make sure that filters which are trying to match Visit-enabled primitives also allow matching said primitives as &str. This honestly seems reasonable enough, since every primitive we recognize and parse out has the exact same string representation as what we parse.

I'm not sure if we can always recover from single invalid directives, though, because we should allow commas in quoted strings in field value filter expressions...

I think the most reasonable option here is:

  • Provide try_new which is strict as today, and for new,
  • At the root level, if encountering ]{}", ignore said directive error, and resume trying to parse directives at the next ,,
  • But also, if encountering a syntax error not at the root level (i.e. within [span] brackets), bail on the remainder of the directive and just give what's been parsed so far.

(This would prohibit ever allowing e.g. target{field=name} without the [span] brackets; I'm not sure exactly what this syntax would mean (fields of the event? shorthand for [{field=name}]?), but it is syntax that seems ripe for giving meaning to that would be given "ignore this comma splice" semantics.)

Alternatively, we could always just bail on the current directive only and attempt to resume at the next ,, which would always allow a superset of what the above would, and would, by my guess, often, if not even most of the time, recover semi-gracefully. However, it would also kind of mean that there's no reserved syntax, as any invalid syntax is kinda locked in to this behavior. It's worth an experiment to see how it behaves, at least, imho.

We currently have filtering semantics for spans with multiple field values, so we should continue to accept that.

I... just want to make sure that everyone knows that yes, while multiple field values is theoretically supported by Directive::from_str and EnvFilter::add_directives, EnvFilter::from_str will never reach this, as {a=b,c=d} contains a ,, and thus the string will be pre-split on said , before it hits Directive::from_str. So any use of EnvFilter that's not using add_directives does not support multiple field directives 😅

Otherwise yes, this seems fair; I'll hook up multiple field directives and give a syntax error on multiple span directives in a single directive. (It's worth noting that if this is recovered from, it will mean adding multiple span directives later would be a behavior-breaking change, thus I'm more in favor of just bailing on the rest of the directive if a syntax error happens not at the root level.)

whitespace trimming

I agree with your analysis here, though I do just want to bring up the question of both

  • when not starting with ", are internal " considered a syntax error, or just allowed as part of the string, and
  • when starting with " for a quoted string, how (and do we) allow internal ".

Just never allowing internal " is quite convenient for implementation, as we don't have to do any unescaping, but means it'd be impossible to ever match a string which both contains " and another syntax character, at least without using regex to match it.


Strangely enough, if we go with all of my suggested directions in this post (which I all stand behind as the best options), I don't know how much if any of the code I've already written into this PR would be usable. Probably not much, but the knowledge gained will allow a new, stronger approach once we've ironed out the kinks.

Anyway, thanks for putting up with my perhaps overly pedantic approach and I'm glad I can help improve tracing even a little bit; it's a great system! (Also, the more code I sneak into rustc and rust-analyzer the better I feel about myself. Please go away, impostor syndrome.)

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 14, 2021

Just a note: I'm committed at this point to effectively rewiring this PR to more directly meet the desires of tracing-subscriber. The direction of the PR remains relevant; specific implementation much less so.

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 21, 2021

MSRV failure is not me, it's the http crate. That said, I may have used MSRV-bumping language features accidentally, though I don't know because of http 🙃

Anyway: rewrite is done, and full parsing support is here! I've just updated the OP with the new status of the PR, and marked it ready for review since all the tests are passing. Some minor improvements do still remain (see OP), but the meat is there.

Maybe not my best-designed parser implementation, but it's successful at its goals.

@CAD97 CAD97 marked this pull request as ready for review September 21, 2021 20:14
@CAD97 CAD97 requested review from davidbarsky, hawkw and a team as code owners September 21, 2021 20:14
@CAD97
Copy link
Contributor Author

CAD97 commented Sep 21, 2021

By the way, the reason for phrasing the parsers as (&str) -> (Result<_, _>, usize) instead of (&mut &str) -> Result<_, _> is to track the index at which an error occurs. This didn't really happen, so if it's deemed fine to just have "invalid directive" instead of a more informative error of at least where parsing failed, the parsing code could be cleaned up by taking &mut &str.

Or I could use a custom Cursor { source, offset } struct as input, or I could go full compiler engineer on parsing, but I'm trying to keep it simple enough, since most of the job is actually just skipping text.

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 24, 2021

Did a force push to retrigger CI, so now MSRV failure is my fault. Will get to that next week probably; otherwise the PR is good to review.

@hawkw
Copy link
Member

hawkw commented Sep 24, 2021

Did a force push to retrigger CI, so now MSRV failure is my fault. Will get to that next week probably; otherwise the PR is good to review.

btw, MSRV is fixed on master, so updating this branch should work

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 25, 2021

GitHub CI actually runs on the merge commit. I was mistakenly using nested or patterns, which are recently stabilized. Force push was a zero change push just to trigger a CI build on a new provisional merge.

@CAD97
Copy link
Contributor Author

CAD97 commented Oct 3, 2021

It looks like netlify published but never reported back.

@CAD97
Copy link
Contributor Author

CAD97 commented Oct 3, 2021

Rebased. Also, note that my analysis of the breaking behavior change note has been changed by #1378 being published on 0.1.X, which effectively just did the "mitigation" follow-up suggested, and changed how &str arguments are captured and thus matched in filters. I've updated the OP with the new situation.

@CAD97
Copy link
Contributor Author

CAD97 commented Mar 17, 2022

Update: I'm moving this effort into a separate effort over at https://github.com/CAD97/tracing-filter.

@CAD97 CAD97 closed this Mar 17, 2022
@CAD97 CAD97 deleted the filter-parsing branch March 17, 2022 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EnvFilter cannot parse filters with multiple fields
2 participants