New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: reduce memory pressure throughout #64233
Conversation
5ff9815
to
99c96b8
Compare
e513da3
to
4a794d0
Compare
4a794d0
to
df69cf9
Compare
+cc @abarganier, @dhartunian, @rimadeodhar. |
b48c5a8
to
4bcc3ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this took so long! The improvements look reasonable, but I'm worried that the pooling is racy. I also think this should get a close review by the folks on observability-infra, to make sure that everything is documented well. You'll know better who to tag.
Reviewed 13 of 13 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/sem/builtins/builtins.go, line 3736 at r1 (raw file):
var rootSpan *tracing.Span if err := ctx.Settings.Tracer.VisitSpans(func(span *tracing.Span) error { if span.TraceID() == traceID || rootSpan != nil {
🤔 don't you want span.TraceID() == traceID && rootSpan == nil
?
pkg/util/tracing/crdbspan.go, line 135 at r1 (raw file):
} func (c *childSpanRefs) list() []*crdbSpan {
This allocs for no good reason, just pointing it out in case this is in a hot path.
pkg/util/tracing/grpc_interceptor.go, line 244 at r1 (raw file):
} func getPooledMetadata() (md metadata.MD, putBack func(md metadata.MD)) {
I'm seeing that we are basically using this as follows
md, putBack := getPooledMetadata()
mutate(md)
ctx = context.WithValue(ctx, raw{md:md})
cs := newClientStream(ctx)
putBack()
return cs
so... data race? We're putting md
into ctx
, then returning it to the pool, but there's no reason to assume that it wouldn't be read from the ctx
later.
pkg/util/tracing/tracer.go, line 410 at r1 (raw file):
helper.crdbSpan.rootSpan = rootSpan } else { helper.crdbSpan.traceEmpty.Store(true)
Does this allocate on every call? I assume it doesn't have to, but not sure it's smart enough to figure it out. Curious why you didn't use an atomic which would also keep crdbSpan
smaller.
pkg/util/tracing/tracer.go, line 790 at r1 (raw file):
// differing only in the explicit carrier type. We do this to avoid allocating // on the heap when iterating over kvs through the Carrier interface. func (t *Tracer) extractMetaFromMapCarrier(carrier MapCarrier) (SpanMeta, error) {
Rather than duplicating this, can we remove MapCarrier? I think it's only used in two places, and these don't care about the particular type of carrier and could use a metadata carrier instead. If we have duplication like this we need testing that it doesn't diverge, and that will hurt given the amount of code duplicated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/sql/sem/builtins/builtins.go, line 3736 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
🤔 don't you want
span.TraceID() == traceID && rootSpan == nil
?
ha, wow, I would've failed this interview. Done.
pkg/util/tracing/crdbspan.go, line 135 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This allocs for no good reason, just pointing it out in case this is in a hot path.
Done.
pkg/util/tracing/grpc_interceptor.go, line 244 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm seeing that we are basically using this as follows
md, putBack := getPooledMetadata() mutate(md) ctx = context.WithValue(ctx, raw{md:md}) cs := newClientStream(ctx) putBack() return cs
so... data race? We're putting
md
intoctx
, then returning it to the pool, but there's no reason to assume that it wouldn't be read from thectx
later.
Good catch, it's not applicable here but still should be for unary RPCs, which I think is most things. Removed.
pkg/util/tracing/tracer.go, line 410 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does this allocate on every call? I assume it doesn't have to, but not sure it's smart enough to figure it out. Curious why you didn't use an atomic which would also keep
crdbSpan
smaller.
You're right, it doesn't. Changed it to use an atomic int instead (I thought they'd be equally performant, alloc-wise, but I see it isn't).
pkg/util/tracing/tracer.go, line 790 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Rather than duplicating this, can we remove MapCarrier? I think it's only used in two places, and these don't care about the particular type of carrier and could use a metadata carrier instead. If we have duplication like this we need testing that it doesn't diverge, and that will hurt given the amount of code duplicated here.
Don't think we can. The MapCarrier (map[string]string) representation is serialized over the wire. On the other end of the RPC, we could convert this representation into a metadataCarrier (map[string][]string), but doing so would incur allocations which we're trying to avoid. To start using the metadataCarrier upstream of raft, for backwards compatibility reasons we'd have to do it over multiple releases, keeping this duplicated code around anyway.
Actually, found a way to do it without duplicating all the code. PTAL.
4bcc3ca
to
be4c1aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @irfansharif)
pkg/util/tracing/grpc_interceptor.go, line 305 at r2 (raw file):
Seeing as how invokeCtx's scope extends as far as our use of the metadata
what does this mean? If I had to explain why this is safe, I would say
This context is only used on the sending path of gRPC, and we trust that gRPC (and any other client interceptors stacked under us) won't access the embedded
md
object after it has returned.
But ultimately, this isn't really inherently safe. Is this a big part of the perf wins? I think we shold back this out if it isn't.
PR message needs an update. I'll leave the actual review to @abarganier in the interest of building ownership muscle 💪🏽 |
(bump, I'll get to the rest of the proposed changes here after soliciting another round of reviews from y'all) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave the actual review to @abarganier in the interest of building ownership muscle 💪🏽
Thanks @tbg! And sorry for lagging on this @irfansharif - I'll be sure to chime in by EOD tomorrow
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @irfansharif, and @rimadeodhar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM. My apologies for asking some fairly basic questions but needed to build some context on some of this code.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @irfansharif)
pkg/util/tracing/crdbspan.go, line 109 at r2 (raw file):
} func (c *childSpanRefs) len() int {
Nit: The len and add methods could be made simpler by adding a variable to childSpanRefs which stores the current spanLen. So the add method would look something like this:
func (c *childSpanRefs) add(ref *crdbSpan) {
if c.spanLen < len(c.preAllocated) {
c.preAllocated[c.spanLen] = ref
}
if maxChildrenPerSpan - spanLen > 0 {
c.overflow = append(c.overflow, ref)
}
spanLen++
}
and len() simply returns spanLen.
However, given that tracing library is more memory constrained than cpu constrained, it might be okay to leave these methods as is in favor of not adding an additional variable.
pkg/util/tracing/grpc_interceptor.go, line 144 at r2 (raw file):
serverSpan.SetTag(gRPCComponentTag.Key, gRPCComponentTag.Value) serverSpan.SetTag(ext.SpanKindRPCServer.Key, ext.SpanKindRPCServer.Value)
Why have we moved these out of the previous calls?
pkg/util/tracing/grpc_interceptor.go, line 254 at r2 (raw file):
Quoted 4 lines of code…
if md.Len() == 2 && len(md[fieldNameTraceID]) == 1 && len(md[fieldNameSpanID]) == 1 { md[fieldNameTraceID][0] = "0" md[fieldNameSpanID][0] = "0" grpcMetadataPool.Put(md)
I don't fully understand what this is doing here? Why do we only add it to the pool if these conditions are met?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay on this. Overall LGTM as well, piggybacking onto one of @rimadeodhar's questions and a convenience nit/curiosity question.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @irfansharif, and @rimadeodhar)
pkg/util/tracing/crdbspan.go, line 511 at r2 (raw file):
Quoted 8 lines of code…
var children []*crdbSpan for i := 0; i < s.mu.recording.children.len(); i++ { children = append(children, s.mu.recording.children.get(i)) } ... for _, child := range children { child.setVerboseRecursively(to) }
This is more of a nit/curiosity question than anything else. Shallow copying children into a slice to then perform some operation on each child span appears to be somewhat common. Would a convenience method for applying some operation to each child make sense?
// Only requires a single pass of the entirety of s.mu.recording.children.len()
func (c *childSpanRefs) apply(func(c *crdbSpan)) {
// Apply provided fn first to preAllocated, then overflow
}
Then you could do...
s.mu.recording.children.apply(func(child *crdbSpan) {
child.setVerboseRecursively(to)
})
I know we're memory constrained here, so I'm also curious how something like this would compare memory-wise to the shallow copy.
pkg/util/tracing/grpc_interceptor.go, line 254 at r2 (raw file):
Previously, rimadeodhar wrote…
if md.Len() == 2 && len(md[fieldNameTraceID]) == 1 && len(md[fieldNameSpanID]) == 1 { md[fieldNameTraceID][0] = "0" md[fieldNameSpanID][0] = "0" grpcMetadataPool.Put(md)
I don't fully understand what this is doing here? Why do we only add it to the pool if these conditions are met?
+1, also wondering about this
This commit attempts to reduce the memory overhead of tracing by doing a few things, guided mostly by BenchmarkTracing and kv95 (see results below). In decreasing order of impact: - We no longer materialize recordings when the trace is "empty" or non-verbose. When traces straddle RPC boundaries, we serialize the recording and to send it over the wire. For "empty" traces (ones with no structured recordings or log messages), this is needlessly allocation-heavy. We end up shipping the trace skeleton (parent-child span relationships, and the operation names + metadata that identify each span). Not only does this allocate within pkg/util/tracing, it also incurs allocations within gRPC where all this serialization is taking place. This commit takes the (opinionated) stance that we can avoid materializing these recordings altogether. We do this by having each span hold onto a reference to the rootspan, and updating an atomic value on the rootspan whenever anything is recorded. When materializing the recording, from any span in the tree, we can consult the root span to see if the trace was non-empty, and allocate if so. - We pre-allocate a small list of child pointer slots for spans to refer to, instead of allocating every time a child span is created. - Span tags aren't rendered unless the span is verbose, but we were still collecting them previously in order to render them if verbosity was toggled. Toggling an active span's verbosity rarely every happens, so now we don't allocate for tags unless the span is already verbose. This also lets us avoid the WithTags option, which allocates even if the span will end up dropping tags. - We were previously allocating SpanMeta, despite the objects being extremely shortlived in practice (they existed primarily to pass on the parent span's metadata to the newly created child spans). We now pass SpanMetas by value. - When extracting SpanMetas for remote spans, through Carriers, instead of doing it through an interface (which incurs allocations), we can use explicit types. - We can slightly improve lock contention around the tracing registry by locking only when we're dealing with rootspans. We can also avoid computing the span duration outside the critical area. --- Before this PR, comparing traced scans vs. not: name old time/op new time/op delta Tracing/Cockroach/Scan1-24 403µs ± 3% 415µs ± 1% +2.82% (p=0.000 n=10+9) Tracing/MultinodeCockroach/Scan1-24 878µs ± 1% 975µs ± 6% +11.07% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Tracing/Cockroach/Scan1-24 23.2kB ± 2% 29.8kB ± 2% +28.64% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 76.5kB ± 5% 95.1kB ± 1% +24.31% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach/Scan1-24 235 ± 2% 258 ± 1% +9.50% (p=0.000 n=10+9) Tracing/MultinodeCockroach/Scan1-24 760 ± 1% 891 ± 1% +17.20% (p=0.000 n=9+10) After this PR: name old time/op new time/op delta Tracing/Cockroach/Scan1-24 437µs ± 4% 443µs ± 3% ~ (p=0.315 n=10+10) Tracing/MultinodeCockroach/Scan1-24 925µs ± 2% 968µs ± 1% +4.62% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Tracing/Cockroach/Scan1-24 23.3kB ± 3% 26.3kB ± 2% +13.24% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 77.0kB ± 4% 81.7kB ± 3% +6.08% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach/Scan1-24 236 ± 1% 241 ± 1% +2.45% (p=0.000 n=9+9) Tracing/MultinodeCockroach/Scan1-24 758 ± 1% 793 ± 2% +4.67% (p=0.000 n=10+10) --- kv95/enc=false/nodes=1/cpu=32 across a few runs also shows a modest improvement before and after this PR, using a sampling rate of 10%: 36929.02 v. 37415.52 (reads/s) +1.30% 1944.38 v. 1968.94 (writes/s) +1.24% Release note: None
be4c1aa
to
eba03c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! I've removed the metadata allocations so it should be less controversial. I'll not immediately go bump our gRPC dep to pick up grpc/grpc-go#4360, I assume we will at some point, but that should help size down the allocation profiles for tracing well.
Can a brother get a green stamp?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @rimadeodhar, and @tbg)
pkg/util/tracing/crdbspan.go, line 109 at r2 (raw file):
Previously, rimadeodhar wrote…
Nit: The len and add methods could be made simpler by adding a variable to childSpanRefs which stores the current spanLen. So the add method would look something like this:
func (c *childSpanRefs) add(ref *crdbSpan) {
if c.spanLen < len(c.preAllocated) {
c.preAllocated[c.spanLen] = ref
}
if maxChildrenPerSpan - spanLen > 0 {
c.overflow = append(c.overflow, ref)
}
spanLen++
}and len() simply returns spanLen.
However, given that tracing library is more memory constrained than cpu constrained, it might be okay to leave these methods as is in favor of not adding an additional variable.
Good idea, done. The pre-allocated array here only helps remove the total number of individual allocations, by allocating 4 pointers all at once. The actual size of this struct matters less, so adding a refcount is more than fine. Thanks!
pkg/util/tracing/crdbspan.go, line 511 at r2 (raw file):
Previously, abarganier (Alex Barganier) wrote…
var children []*crdbSpan for i := 0; i < s.mu.recording.children.len(); i++ { children = append(children, s.mu.recording.children.get(i)) } ... for _, child := range children { child.setVerboseRecursively(to) }
This is more of a nit/curiosity question than anything else. Shallow copying children into a slice to then perform some operation on each child span appears to be somewhat common. Would a convenience method for applying some operation to each child make sense?
// Only requires a single pass of the entirety of s.mu.recording.children.len() func (c *childSpanRefs) apply(func(c *crdbSpan)) { // Apply provided fn first to preAllocated, then overflow }
Then you could do...
s.mu.recording.children.apply(func(child *crdbSpan) { child.setVerboseRecursively(to) })
I know we're memory constrained here, so I'm also curious how something like this would compare memory-wise to the shallow copy.
In this case it wouldn't have allocated, cause we're simple calling setVerboseRecursively on each element, but in the general case where we collect responses from each element would allocate on the heap as Go needs to ensure the longevity across closure lifetimes. I think for that reason this pattern isn't a super familiar one in Go (also, in case you haven't noticed, we revel in the verbosity). I do agree it makes things cleaner though.
Example where we're collecting results per child:
for _, child := range children {
result = append(result, child.getRecording(everyoneIsV211, wantTags)...)
}
pkg/util/tracing/grpc_interceptor.go, line 144 at r2 (raw file):
Previously, rimadeodhar wrote…
serverSpan.SetTag(gRPCComponentTag.Key, gRPCComponentTag.Value) serverSpan.SetTag(ext.SpanKindRPCServer.Key, ext.SpanKindRPCServer.Value)
Why have we moved these out of the previous calls?
The WithTags option allocates even if the underlying span is going to end up being non-verbose, and will drop the tags anyway. The SetTag API on the other hand can avoid allocating cause it has visibility into the fully constructed span, and can see whether or not it's verbose.
pkg/util/tracing/grpc_interceptor.go, line 254 at r2 (raw file):
Previously, abarganier (Alex Barganier) wrote…
+1, also wondering about this
This has gone away cause I've since given up on trying to pool metadata objects, but the idea here was to only pool metadata.MD objects that pkg/tracing mutated. It's possible that the caller into package tracing had also embedded into ctx metadata of its own. It'd be erroneous to pool those objects, and possibly re-use that metadata for subsequent requests that didn't intend to specify that metadata.
pkg/util/tracing/grpc_interceptor.go, line 305 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Seeing as how invokeCtx's scope extends as far as our use of the metadata
what does this mean? If I had to explain why this is safe, I would say
This context is only used on the sending path of gRPC, and we trust that gRPC (and any other client interceptors stacked under us) won't access the embedded
md
object after it has returned.But ultimately, this isn't really inherently safe. Is this a big part of the perf wins? I think we shold back this out if it isn't.
Done. It isn't a big part, though not super small either. Seeing as how we've collapsed empty recordings away, that gave us most of the wins here, so removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 13 files at r1, 2 of 4 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @rimadeodhar)
Woot! bors r=abarganier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @irfansharif)
pkg/util/tracing/grpc_interceptor.go, line 144 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
The WithTags option allocates even if the underlying span is going to end up being non-verbose, and will drop the tags anyway. The SetTag API on the other hand can avoid allocating cause it has visibility into the fully constructed span, and can see whether or not it's verbose.
Aah I see, thanks for the explanation!
pkg/util/tracing/grpc_interceptor.go, line 254 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This has gone away cause I've since given up on trying to pool metadata objects, but the idea here was to only pool metadata.MD objects that pkg/tracing mutated. It's possible that the caller into package tracing had also embedded into ctx metadata of its own. It'd be erroneous to pool those objects, and possibly re-use that metadata for subsequent requests that didn't intend to specify that metadata.
Thanks! That makes sense.
pkg/util/tracing/tracer.go, line 598 at r3 (raw file):
} // var noopSpanMeta = (*SpanMeta)(nil)
Can we delete this comment?
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/sem/builtins/builtins.go, line 3736 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
ha, wow, I would've failed this interview. Done.
Is this code correct? What makes this truly return the root span and not just some random span from the trace - the first in a random iteration order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/sem/builtins/builtins.go, line 3736 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Is this code correct? What makes this truly return the root span and not just some random span from the trace - the first in a random iteration order?
The map it's iterating over only contains root spans:
cockroach/pkg/util/tracing/tracer.go
Lines 166 to 175 in 1fc4fc1
// activeSpans is a map that references all non-Finish'ed local root spans, | |
// i.e. those for which no WithLocalParent(<non-nil>) option was supplied. | |
// The map is keyed on the span ID, which is deterministically unique. | |
// | |
// In normal operation, a local root Span is inserted on creation and | |
// removed on .Finish(). | |
// | |
// The map can be introspected by `Tracer.VisitSpans`. A Span can also be | |
// retrieved from its ID by `Tracer.GetActiveSpanFromID`. | |
activeSpans struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/sem/builtins/builtins.go, line 3736 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
The map it's iterating over only contains root spans:
cockroach/pkg/util/tracing/tracer.go
Lines 166 to 175 in 1fc4fc1
// activeSpans is a map that references all non-Finish'ed local root spans, // i.e. those for which no WithLocalParent(<non-nil>) option was supplied. // The map is keyed on the span ID, which is deterministically unique. // // In normal operation, a local root Span is inserted on creation and // removed on .Finish(). // // The map can be introspected by `Tracer.VisitSpans`. A Span can also be // retrieved from its ID by `Tracer.GetActiveSpanFromID`. activeSpans struct {
oh doh I forgot cause I'm in the middle of mucking with that.
But then is the rootSpan != nil
condition necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/sem/builtins/builtins.go, line 3736 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
oh doh I forgot cause I'm in the middle of mucking with that.
But then is therootSpan != nil
condition necessary?
Hm, no, we probably want to return a iterutil.StopIteration()
when we find the first (and only) matching root span.
This commit attempts to reduce the memory overhead of tracing by doing a
few things, guided mostly by BenchmarkTracing and kv95 (see results
below). In decreasing order of impact:
We no longer materialize recordings when the trace is "empty" or
non-verbose. When traces straddle RPC boundaries, we serialize the
recording and to send it over the wire. For "empty" traces (ones with
no structured recordings or log messages), this is needlessly
allocation-heavy. We end up shipping the trace skeleton (parent-child
span relationships, and the operation names + metadata that identify
each span). Not only does this allocate within pkg/util/tracing, it
also incurs allocations within gRPC where all this serialization is
taking place. This commit takes the (opinionated) stance that we can
avoid materializing these recordings altogether.
We do this by having each span hold onto a reference to the rootspan,
and updating an atomic value on the rootspan whenever anything is
recorded. When materializing the recording, from any span in the tree,
we can consult the root span to see if the trace was non-empty, and
allocate if so.
We pre-allocate a small list of child pointer slots for spans to refer
to, instead of allocating every time a child span is created.
Span tags aren't rendered unless the span is verbose, but we were
still collecting them previously in order to render them if verbosity
was toggled. Toggling an active span's verbosity rarely every happens,
so now we don't allocate for tags unless the span is already verbose.
This also lets us avoid the WithTags option, which allocates even if
the span will end up dropping tags.
We were previously allocating SpanMeta, despite the objects being
extremely shortlived in practice (they existed primarily to pass on
the parent span's metadata to the newly created child spans). We now
pass SpanMetas by value.
We can make use of explicit carriers when extracting SpanMeta from
remote spans, instead of accessing data through the Carrier interface.
This helps reduce SpanMeta allocations, at the cost of duplicating some
code.
metadata.MD, used to carry span metadata across gRPC, is also
relatively short-lived (existing only for the duration of the RPC).
Its API is also relatively allocation-heavy (improved with
metadata: reduce memory footprint in FromOutgoingContext grpc/grpc-go#4360), allocating for every key
being written. Tracing has a very specific usage pattern (adding to
the metadata.MD only the trace and span ID), so we can pool our
allocations here.
We can slightly improve lock contention around the tracing registry by
locking only when we're dealing with rootspans. We can also avoid
computing the span duration outside the critical area.
Before this PR, comparing traced scans vs. not:
After this PR:
kv95/enc=false/nodes=1/cpu=32 across a few runs also shows a modest
improvement before and after this PR, using a sampling rate of 10%:
Release note: None