tracing: reorganize benchmarks for comparability #2178

hawkw · 2022-06-24T20:00:29Z

Currently, tracing's benchmark suite benchmarks the same behaviors
(e.g. creating a span, recording an event, etc) across a handful of
cases: with no default dispatcher, with a global default, and with a
scoped (thread-local) default. We use criterion's benchmark_group to
represent each kind of dispatcher, and bench_function for each
behavior being measured.

This is actually kind of backwards relative to how Criterion is
supposed to be used. bench_function is intended for comparing
different implementations of the same thing, with the
benchmark_group representing what's being compared. If we inverted the
structure of these benchmarks, Criterion would give us nicer plots that
would allow comparing the performance of each dispatch type on the same
task.

This PR reorganizes the benchmarks so that each behavior being tested
(such as entering a span or recording an event) is a benchmark_group,
and each dispatch type (none, global, or scoped) is a bench_function
within that group. Now, Criterion will generate plots for each group
comparing the performance of each dispatch type in that benchmark.

For example, we now get nice comparisons like this:

Unfortunately, this required splitting each benchmark type out into its
own file. This is because, once we set the global default dispatcher
within one benchmark group, it would remain set for the entire lifetime
of the process --- there would be no way to test something else with no
global default. But, I think this is fine, even though it means we now
have a bunch of tiny files: it also allows us to run an individual
benchmark against every combination of dispatch types, without having to
run unrelated benches. This is potentially useful if (for example)
someone is changing only the code for recording events, and not spans.

Currently, `tracing`'s benchmark suite benchmarks the same behaviors (e.g. creating a span, recording an event, etc) across a handful of cases: with no default dispatcher, with a global default, and with a scoped (thread-local) default. We use criterion's `benchmark_group` to represent each kind of dispatcher, and `bench_function` for each behavior being measured. This is actually kind of backwards relative to how Criterion is _supposed_ to be used. `bench_function` is intended for comparing different implementations of the *same* thing, with the `benchmark_group` representing what's being compared. If we inverted the structure of these benchmarks, Criterion would give us nicer plots that would allow comparing the performance of each dispatch type on the same task. This PR reorganizes the benchmarks so that each behavior being tested (such as entering a span or recording an event) is a `benchmark_group`, and each dispatch type (none, global, or scoped) is a `bench_function` within that group. Now, Criterion will generate plots for each group comparing the performance of each dispatch type in that benchmark. Unfortunately, this required splitting each benchmark type out into its own file. This is because, once we set the global default dispatcher within one benchmark group, it would remain set for the entire lifetime of the process --- there would be no way to test something else with no global default. But, I think this is fine, even though it means we now have a bunch of tiny files: it also allows us to run an individual benchmark against every combination of dispatch types, without having to run unrelated benches. This is potentially useful if (for example) someone is changing only the code for recording events, and not spans.

Currently, `tracing`'s benchmark suite benchmarks the same behaviors (e.g. creating a span, recording an event, etc) across a handful of cases: with no default dispatcher, with a global default, and with a scoped (thread-local) default. We use criterion's `benchmark_group` to represent each kind of dispatcher, and `bench_function` for each behavior being measured. This is actually kind of backwards relative to how Criterion is _supposed_ to be used. `bench_function` is intended for comparing different implementations of the *same* thing, with the `benchmark_group` representing what's being compared. If we inverted the structure of these benchmarks, Criterion would give us nicer plots that would allow comparing the performance of each dispatch type on the same task. This PR reorganizes the benchmarks so that each behavior being tested (such as entering a span or recording an event) is a `benchmark_group`, and each dispatch type (none, global, or scoped) is a `bench_function` within that group. Now, Criterion will generate plots for each group comparing the performance of each dispatch type in that benchmark. For example, we now get nice comparisons like this: ![image](https://user-images.githubusercontent.com/2796466/175659314-835664ac-a8cf-4b07-91ee-f8cfee77bfbb.png) Unfortunately, this required splitting each benchmark type out into its own file. This is because, once we set the global default dispatcher within one benchmark group, it would remain set for the entire lifetime of the process --- there would be no way to test something else with no global default. But, I think this is fine, even though it means we now have a bunch of tiny files: it also allows us to run an individual benchmark against every combination of dispatch types, without having to run unrelated benches. This is potentially useful if (for example) someone is changing only the code for recording events, and not spans.

hawkw requested review from davidbarsky and a team as code owners June 24, 2022 20:00

Merge branch 'master' into eliza/benchmark-consistency

46f57c9

hawkw merged commit 7008df9 into master Jun 24, 2022

hawkw deleted the eliza/benchmark-consistency branch June 24, 2022 20:30

davidbarsky mentioned this pull request Sep 26, 2023

chore: backport roughly a year's worth of changes #2728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing: reorganize benchmarks for comparability #2178

tracing: reorganize benchmarks for comparability #2178

hawkw commented Jun 24, 2022 •

edited

tracing: reorganize benchmarks for comparability #2178

tracing: reorganize benchmarks for comparability #2178

Conversation

hawkw commented Jun 24, 2022 • edited

hawkw commented Jun 24, 2022 •

edited