Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TraceContextExt::span() return an Option #1089

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

djc
Copy link
Contributor

@djc djc commented Jun 1, 2023

Changes

The changes that @shaun-cox made in #1088 make sense to me, but I wanted to look into the core TraceContextExt API, which I don't think makes much sense for Rust. Instead of having a span() method that fabricates a NOOP_SPAN and yields if it there is no active span and a separate has_active_span() to check if there is a currently active span, it seems more idiomatic to make span() yield an Option.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@djc djc requested a review from a team as a code owner June 1, 2023 08:58
@djc
Copy link
Contributor Author

djc commented Jun 1, 2023

I'm happy to look at the breaking tests here, but would like some consensus first that this is a good direction.

@shaun-cox
Copy link
Contributor

I like the direction! I'll add your remote and run some benchmarks...

@jtescher
Copy link
Member

jtescher commented Jun 1, 2023

Hm I agree that this is more idiomatic but think it needs to return something to satisfy this requirement of the spec.

If the parent Context contains no Span, an empty non-recording Span MUST be returned instead (i.e., having a SpanContext with all-zero Span and Trace IDs, empty Tracestate, and unsampled TraceFlags).

Could possibly have another method that has this behavior to work around the spec constraint? but the confusion may make it net negative from a DX perspective.

@shaun-cox
Copy link
Contributor

@djc, here are the benchmark results for this change.

shaun@shaun-amd:~/code/open-telemetry/opentelemetry-rust(djc-optional-span)$ taskset --cpu-list 4 cargo bench -p opentelemetry_sdk --bench trace -- --load-baseline djc --baseline main start-end
    Finished bench [optimized] target(s) in 0.12s
     Running benches/trace.rs (target/release/deps/trace-b2444cb9aa7f705e)
start-end-span/always-sample
                        time:   [594.56 ns 618.02 ns 639.49 ns]
                        change: [-12.514% -7.5863% -2.2491%] (p = 0.01 < 0.05)
                        Performance has improved.
start-end-span/never-sample
                        time:   [171.08 ns 171.50 ns 171.91 ns]
                        change: [-8.5121% -8.3054% -8.0769%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

start-end-span-4-attrs/always-sample
                        time:   [944.12 ns 977.55 ns 1.0092 µs]
                        change: [-12.035% -6.8071% -0.5840%] (p = 0.03 < 0.05)
                        Change within noise threshold.
start-end-span-4-attrs/never-sample
                        time:   [220.33 ns 220.79 ns 221.28 ns]
                        change: [-4.6267% -4.4331% -4.2439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

start-end-span-8-attrs/always-sample
                        time:   [1.4526 µs 1.4984 µs 1.5406 µs]
                        change: [-2.8210% +4.5795% +13.103%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
start-end-span-8-attrs/never-sample
                        time:   [270.36 ns 270.95 ns 271.55 ns]
                        change: [-5.4790% -5.2004% -4.9441%] (p = 0.00 < 0.05)
                        Performance has improved.

start-end-span-all-attr-types/always-sample
                        time:   [1.2232 µs 1.2619 µs 1.2969 µs]
                        change: [-4.2679% +1.8863% +8.1148%] (p = 0.55 > 0.05)
                        No change in performance detected.
start-end-span-all-attr-types/never-sample
                        time:   [234.85 ns 235.36 ns 235.90 ns]
                        change: [-5.0780% -4.7854% -4.5055%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

start-end-span-all-attr-types-2x/always-sample
                        time:   [2.0825 µs 2.1587 µs 2.2285 µs]
                        change: [-12.105% -4.7848% +2.6529%] (p = 0.24 > 0.05)
                        No change in performance detected.
start-end-span-all-attr-types-2x/never-sample
                        time:   [298.57 ns 299.12 ns 299.63 ns]
                        change: [-4.5375% -4.1205% -3.7261%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  5 (5.00%) low severe
  13 (13.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

@hdost
Copy link
Contributor

hdost commented Jun 2, 2023

Maybe the trade off could be span() returns as it does today, but we have try_span() though I'm not honestly sure what value it provides 🤔 .
Maybe a default Span is the invalid span as @jtescher listed in the spec?

@davidbarsky
Copy link

(Might as well add some feedback)

If I'm reading that spec correctly, it seems to be designed around languages with pervasive nullability, where retuning a null value could take down the process. @cijothomas mentioned recently that the specification is generally flexible, and .NET has taken some liberties with their official implementation to adhere more closely to .NET's idioms. I don't see why Rust shouldn't follow that precedent, especially if the divergence from the specification is an improvement in UX. Put differently, I think adhering to the spirit of the the specification rather than letter is warranted.

@djc
Copy link
Contributor Author

djc commented Jun 2, 2023

If I'm reading that spec correctly, it seems to be designed around languages with pervasive nullability, where retuning a null value could take down the process. @cijothomas mentioned recently that the specification is generally flexible, and .NET has taken some liberties with their official implementation to adhere more closely to .NET's idioms. I don't see why Rust shouldn't follow that precedent, especially if the divergence from the specification is an improvement in UX. Put differently, I think adhering to the spirit of the the specification rather than letter is warranted.

I was reading some part of the OpenTelemetry spec/documentation earlier that mentioned something like implementations should focus on the conceptual capabilities more than the exact API prescriptions, but of course I can't find that bit now. I do feel like the way this is written doesn't really make sense for Rust -- and I feel like providing an idiomatic API does make things a lot easier.

If you look at the code changes here, I think a lot of changes make the code simpler to follow/more robust, and there's potentially other simplifications (like dropping the Option inside SynchronizedSpan, or some validity checks that are basically trying to recover the vailidity of the Span after checking whether .span() is Some).

(Perhaps we should propose a spec modification to explicitly allow this kind of leeway in the quoted language and see how that goes over with the editors?)

@TommyCpp
Copy link
Contributor

TommyCpp commented Jun 2, 2023

I feel like the reason why spec asks for a noop spans is to make sure the propgation works in absence of a SDK. So as long as we can make sure the following is true, we should return a Option fromspan() as it's more friendly for Rust users

This means that a SpanContext that has been provided by a configured Propagator will be propagated through to any child span and ultimately also Inject, but that no new SpanContexts will be created.

@cijothomas
Copy link
Member

I was reading some part of the OpenTelemetry spec/documentation earlier that mentioned something like implementations should focus on the conceptual capabilities more than the exact API prescriptions, but of course I can't find that bit now

It's right here (and in every language's contributing guide.): https://github.com/open-telemetry/opentelemetry-rust/blob/main/CONTRIBUTING.md#focus-on-capabilities-not-structure-compliance

@djc
Copy link
Contributor Author

djc commented Jun 9, 2023

@jtescher any further thoughts on this?

@jtescher
Copy link
Member

jtescher commented Jun 9, 2023

Hm so I think the spec provides default values in APIs like this mostly for convenience. E.g. as a user of this crate you don't need to check for None cases.

// can always call methods, values ignored by non recording spans
cx.span().add_event(my_event)

// would have to explicitly handle the none case, with map or if let some..
cx.span().map(|span| span.add_event(my_event))

A similar choice is made in tracing's Span::current which always returns a Span, with a disabled one instead of None.

I'm not opposed to deviating from the spec, especially where it makes things more erconomic or idiomatic, I'm just not sure that is the case here. Are there other crates or otel language impls that return None in this case?

@shaun-cox
Copy link
Contributor

// can always call methods, values ignored by non recording spans
cx.span().add_event(my_event)

// would have to explicitly handle the none case, with map or if let some..
cx.span().map(|span| span.add_event(my_event))

My concern for the first case above, and the current direction of returning no-op spans, is the performance implications. While it's nice that once can always call all span methods and the values, are ignored, it's definitely not nice that one still has to do all the work to create those ignored values in the first place. (And they are not exactly "cheap" values... one has to give up ownership of a Vec<KeyValue> to add_event.)

In my ideal world, I want a lazily evaluated lambda to be passed which only gets called back when the all the data it's going to produce is actually going to be consumed for a benefit and not just immediately dropped.

If one wants to add multiple attributes to a span (only if its recording) the current api would have us do:

cx.span().add_attribute(KeyValue::new(...));
cx.span().add_attribute(KeyValue::new(...));
...

Which is a real head-scratcher for me to understand in the context of the goals of instrumenting to be as close to zero-overhead as possible for the case when no sdk is being used or when 99.9% of the spans are not being sampled.

So, I strongly prefer the Option<Span> approach.

Another alternative is a visit_span method on TraceContextExt which takes an Fn<&mut Span> argument or something and only calls the callback if there is an active span...?

@jtescher
Copy link
Member

My concern for the first case above, and the current direction of returning no-op spans, is the performance implications. While it's nice that once can always call all span methods and the values, are ignored, it's definitely not nice that one still has to do all the work to create those ignored values in the first place. (And they are not exactly "cheap" values... one has to give up ownership of a Vec to add_event.)

This would be an argument for changing all of the APIs that return spans (and likely meter instruments too) and removing the noop spec impl. E.g. creating a new span would need to also return an option as the span may not be recording. Could consider doing that, but it would be a larger scope.

If one wants to add multiple attributes to a span (only if its recording) the current api would have us do:

The spec would have you do a check to Span::is_recording

let span = cx.span();
if span.is_recording() {
    let attrs = expensive_call(); 
    span.set_attributes(attrs);
}

Another alternative is a visit_span method on TraceContextExt which takes an Fn<&mut Span> argument or something and only calls the callback if there is an active span...?

I think that would likely be the most compliant way to accomplish what you're looking for as you'd want to both check that there is an active span and that the span is recording. Would look similar to the check above with a closure vs the current if statement to make it also work on tracer created spans and could have a similar method on the context ext if necessary.

let span = cx.span();
span.visit_recording(|span| {
    let attrs = expensive_call(); 
    span.set_attributes(attrs);
});

@djc
Copy link
Contributor Author

djc commented Jun 15, 2023

This would be an argument for changing all of the APIs that return spans (and likely meter instruments too) and removing the noop spec impl. E.g. creating a new span would need to also return an option as the span may not be recording. Could consider doing that, but it would be a larger scope.

I think I would be in favor of doing that across the board. It makes the notion of recording vs non-recording spans (etc) explicit in the type system, which I think is the idiomatic way to do things in Rust (which also makes it easier to build performant software, since the trade-offs are more obvious).

I think that would likely be the most compliant way to accomplish what you're looking for as you'd want to both check that there is an active span and that the span is recording. Would look similar to the check above with a closure vs the current if statement to make it also work on tracer created spans and could have a similar method on the context ext if necessary.

I think where possible, having a simple Option-based API is more ergonomic, and more clear, than relying on closures that are called conditionally.

@jtescher
Copy link
Member

I think where possible, having a simple Option-based API is more ergonomic, and more clear, than relying on closures that are called conditionally.

Agreed, but would still potentially need a way of distinguishing the concepts of an active vs recording. You can have all 3 possibilities: no span, active with recording, active without recording. So either this is returning None to only indicate that there is no active span, in which case the API becomes something along the lines of:

if let Some(recording_span) = cx.span().filter(|span| span.is_recording()) {
    let attrs = expensive_call(); 
    span.set_attributes(attrs);
}

Or you need a different way of propagating the current trace id as this method would no longer work:

// default instead of existing trace if span is not recording
let span_context = cx.span().map(|span| span.span_context()).unwrap_or_default() 

The current API requires the explicit check to Span::is_recording but that may still have a more consistent DX with fewer footguns.

@djc
Copy link
Contributor Author

djc commented Jun 15, 2023

Okay, fair point. So maybe we should yield a trivariant enum?

enum SpanState<'a> {
    Recording(&'a mut Span),
    Active(&'a Span),
    None,
}

@jtescher
Copy link
Member

Okay, fair point. So maybe we should yield a trivariant enum?

I think that would be the clearest way to represent it in the type system. But do we want end users of this crate to have to handle all types at each callsite?

Still seems like the current API has the simplest mental model exposed that allows for optional perf optimizations through checking span recording state.

@djc
Copy link
Contributor Author

djc commented Jun 26, 2023

I think that would be the clearest way to represent it in the type system. But do we want end users of this crate to have to handle all types at each callsite?

Still seems like the current API has the simplest mental model exposed that allows for optional perf optimizations through checking span recording state.

I mean, we can then add some helper methods on top of that type that make it easy to adapt to the caller's needs? Like fn active(&self) -> Option<&Span> and fn recording(&mut self) -> Option<&mut Span>.

I think Rust APIs usually optimize not for simplicity but for accurately modeling the underlying state. Yes, this can make things a bit more verbose, but also less error-prone and often more efficient. For such a core building block, it seems unfortunate to go the other way on this particular convention.

@shaun-cox
Copy link
Contributor

In the interest of moving this forward towards a decision, here's my opinion on an improvement.

  1. change TraceContextExt to have span() return Option<SpanRef<'_>>
  2. remove TraceContextExt has_active_span() function, and just have callers check span().is_some() instead.

To handle further checking if the span is recording, encourage:

    let span = cx.span().filter(is_recording);

I feel this is the idiomatic middle-ground that is still simple to use.

@djc
Copy link
Contributor Author

djc commented Jun 27, 2023

@shaun-cox just to clarify, how do you see that being different from the current changes in this PR? (I think it's pretty close?)

@shaun-cox
Copy link
Contributor

@shaun-cox just to clarify, how do you see that being different from the current changes in this PR? (I think it's pretty close?)

@djc, yep, no different than this PR. Just trying to vote for banking this progress rather than considering a bigger trivariant approach right now. Thanks!

@jtescher
Copy link
Member

I personally still think the existing APIs are a better choice as I believe others will find them simpler to understand and use, as well as having the same performance optimizations available that an Option or other type-level approach would have, without the downsides of being less idiomatic in terms of otel spec compliance and symmetry with other otel language implementations, and not matching existing rust ecosystem implementations like tracing, all of which return spans from these APIs rather than another type as far as I can see.

That being said if the other @open-telemetry/rust-approvers feel that this is the best way forward I won't block it. But if an alternate API is being considered it should be consistent with the Tracer and Meter traits or have similar capabilities as it would be extra confusing if they behaved differently.

@shaun-cox
Copy link
Contributor

I believe others will find them simpler to understand and use, as well as having the same performance optimizations available that an Option or other type-level approach would have

@jtescher , I'm really trying to see where the same performance optimizations would be available with the current API. Could you expand a bit on it? I'd characterize a lot of the perf wins I've recently been involved with as eliminating expensive .clone() on these no-op instances that otherwise wouldn't be there if those instances were held by an Option that was None instead. I admit to being biased in favor of Option, and it may be clouding my view into other optimizations that may be possible without it...

@jtescher
Copy link
Member

@jtescher , I'm really trying to see where the same performance optimizations would be available with the current API. Could you expand a bit on it? I'd characterize a lot of the perf wins I've recently been involved with as eliminating expensive .clone() on these no-op instances that otherwise wouldn't be there if those instances were held by an Option that was None instead. I admit to being biased in favor of Option, and it may be clouding my view into other optimizations that may be possible without it...

Which expensive clones on the no ops are you referring to? I was responding to the perf issues you were bringing up in #1089 (comment), e.g. being able to conditionally record fields if there is overhead when not recording.

@shaun-cox
Copy link
Contributor

I was referring to #1140, where avoiding cloning Context in order to invoke has_active_span() was a performance win, and how that is the same pattern here... i.e. avoid having span() return/clone a no-op Span for further manipulation.

So instead of this:

let span = cx.span();
if span.is_recording() {
    let attrs = expensive_call(); 
    span.set_attributes(attrs);
}

We could do something like this:

cx.span()
    .filter(is_recording)
    .map(|span| {
        let attrs = expensive_call(); 
        span.set_attributes(attrs);
    })
});

Now there is no no-op Span initialized and dropped... it's just getting an optional SpanRef and early exiting as it proceeds down the pipeline.

Putting it together with previous optimization to avoid obtaining a cloned Context too:

Context::map_current(TraceContextEx::span)
    .filter(is_recording)
    .map(|span| {
        let attrs = expensive_call(); 
        span.set_attributes(attrs);
    })
});

@jtescher
Copy link
Member

Hm not seeing any overhead of dropping the noop span when running benches vs an option.

benchmarks
group.bench_function(BenchmarkId::new("noop", p), |b| {
    b.iter(|| {
        black_box(Context::map_current(|cx| {
            let span = cx.span();
            drop(span)
        }))
    })
});

group.bench_function(BenchmarkId::new("option", p), |b| {
    b.iter(|| {
        black_box(Context::map_current(|cx| {
            if let Some(span) = cx.span_opt() {
                drop(span)
            }
        }))
    })
});

They look effectively the same in terms of active or not active perf

context/noop/no-active-span
                        time:   [3.9029 ns 3.9201 ns 3.9380 ns]
                        change: [-0.8448% +0.1502% +1.1875%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
context/option/no-active-span
                        time:   [3.8490 ns 3.8651 ns 3.8833 ns]
                        change: [-0.6115% +0.0975% +0.8096%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

context/noop/with-active-span
                        time:   [6.8758 ns 6.9062 ns 6.9364 ns]
                        change: [-1.2357% -0.6886% -0.1323%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
context/option/with-active-span
                        time:   [6.9951 ns 7.0255 ns 7.0601 ns]
                        change: [-1.8125% -0.8761% +0.0237%] (p = 0.06 > 0.05)
                        No change in performance detected.

Using this commit for diff jtescher@beabfcd

Possibly something else was the source of the perf differences in #1140?

@shaun-cox
Copy link
Contributor

Possibly something else was the source of the perf differences in #1140?

Sorry, I didn't mean to confuse that it was just no-op Span clone elimination that would explain all of the performance gains, just that an example of using Option-like optimizations ala #1140 rather than returning cloned temporaries did result in performance gains... similar to how this PR has also measured performance gains (up to 8% in some scenarios glancing at the top of this thread) by having span() return an Option rather than a no-op instance.

My original question a few replies up was trying to gain insight into how to achieve these performance optimizations without using Option, as you alluded that you think they are still possible?

@jtescher
Copy link
Member

I can't reproduce the 8% number above. When I compare the start-end-span benchmarks against main I actually see worse performance on this branch, so something else may have changed in the mean time.

@hdost
Copy link
Contributor

hdost commented Nov 24, 2023

Perhaps we could talk about this at the next SIG if @jtescher can make it? Since we've been talking about performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants