Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
tracing: reduce memory pressure throughout
This commit attempts to reduce the memory overhead of tracing by doing a few things, guided mostly by BenchmarkTracing and kv95 (see results below). In decreasing order of impact: - We no longer materialize recordings when the trace is "empty" or non-verbose. When traces straddle RPC boundaries, we serialize the recording and to send it over the wire. For "empty" traces (ones with no structured recordings or log messages), this is needlessly allocation-heavy. We end up shipping the trace skeleton (parent-child span relationships, and the operation names + metadata that identify each span). Not only does this allocate within pkg/util/tracing, it also incurs allocations within gRPC where all this serialization is taking place. This commit takes the (opinionated) stance that we can avoid materializing these recordings altogether. We do this by having each span hold onto a reference to the rootspan, and updating an atomic value on the rootspan whenever anything is recorded. When materializing the recording, from any span in the tree, we can consult the root span to see if the trace was non-empty, and allocate if so. - We pre-allocate a small list of child pointer slots for spans to refer to, instead of allocating every time a child span is created. - Span tags aren't rendered unless the span is verbose, but we were still collecting them previously in order to render them if verbosity was toggled. Toggling an active span's verbosity rarely every happens, so now we don't allocate for tags unless the span is already verbose. This also lets us avoid the WithTags option, which allocates even if the span will end up dropping tags. - We were previously allocating SpanMeta, despite the objects being extremely shortlived in practice (they existed primarily to pass on the parent span's metadata to the newly created child spans). We now pass SpanMetas by value. - We can make use of explicit carriers when extracting SpanMeta from remote spans, instead of accessing data through the Carrier interface. This helps reduce SpanMeta allocations, at the cost of duplicating some code. - metadata.MD, used to carry span metadata across gRPC, is also relatively short-lived (existing only for the duration of the RPC). Its API is also relatively allocation-heavy (improved with grpc/grpc-go#4360), allocating for every key being written. Tracing has a very specific usage pattern (adding to the metadata.MD only the trace and span ID), so we can pool our allocations here. - We can slightly improve lock contention around the tracing registry by locking only when we're dealing with rootspans. We can also avoid computing the span duration outside the critical area. --- Before this PR, comparing traced scans vs. not: name old time/op new time/op delta Tracing/Cockroach/Scan1-24 403µs ± 3% 415µs ± 1% +2.82% (p=0.000 n=10+9) Tracing/MultinodeCockroach/Scan1-24 878µs ± 1% 975µs ± 6% +11.07% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Tracing/Cockroach/Scan1-24 23.2kB ± 2% 29.8kB ± 2% +28.64% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 76.5kB ± 5% 95.1kB ± 1% +24.31% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach/Scan1-24 235 ± 2% 258 ± 1% +9.50% (p=0.000 n=10+9) Tracing/MultinodeCockroach/Scan1-24 760 ± 1% 891 ± 1% +17.20% (p=0.000 n=9+10) After this PR: name old time/op new time/op delta Tracing/Cockroach/Scan1-24 437µs ± 4% 443µs ± 3% ~ (p=0.315 n=10+10) Tracing/MultinodeCockroach/Scan1-24 925µs ± 2% 968µs ± 1% +4.62% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Tracing/Cockroach/Scan1-24 23.3kB ± 3% 26.3kB ± 2% +13.24% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 77.0kB ± 4% 81.7kB ± 3% +6.08% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach/Scan1-24 236 ± 1% 241 ± 1% +2.45% (p=0.000 n=9+9) Tracing/MultinodeCockroach/Scan1-24 758 ± 1% 793 ± 2% +4.67% (p=0.000 n=10+10) --- kv95/enc=false/nodes=1/cpu=32 across a few runs also shows a modest improvement before and after this PR, using a sampling rate of 10%: 36929.02 v. 37415.52 (reads/s) +1.30% 1944.38 v. 1968.94 (writes/s) +1.24% Release note: None
- Loading branch information