Benchmark instability and usefulness #3040

koivunej · 2022-12-09T09:12:31Z

While using the label run-benchmarks for #2873 and #2875, I noticed that test_bulk_insert[vanilla] test case was producing some interesting numbers:

11.144s for 145e7e4
15.514s for Tokio based walredo #2875 rebased on above commit
11.898s for a46a81b
17.411s for Tokio based walredo #2875 rebased on above commit

Similarly for test_copy[vanilla].copy_extend I found (for same 4 points):

12.880s
12.168s
11.785s
4.82s

I understand that vanilla benchmarks are expected to be more stable than our changes, so they could be used to find some sort normalization or noise floor for the results making them comparable.

Further, looking at the database where we store the results, here's a histogram for the test_bulk_insert[vanilla] over last 100 (covering 2022-12-08 to 2022-11-07) and all metrics (843, from 2022-02-15):

I did not look at all of the results this closely, mostly just the 4 points you can find in a google sheet.

I think this means, that we cannot really make any decisions based on the benchmarks as they are right now, because the noise level is so high.

The text was updated successfully, but these errors were encountered:

knizhnik · 2022-12-09T09:39:37Z

To make results more stable we need to run this tests for much more time. Not sure that it is possible.

koivunej · 2022-12-13T10:32:07Z

run this tests for much more time

I don't think it's feasible, it already takes a very long time to run the suite (1h).

Metal instances would be an easy way, but costly. Though we don't need the largest box, we just want definable noise level.

I do wonder if all of the benchmarks should be started with flushed vm caches as well, not to have any interference from previous runs. Similarly to local benchmarking, I wonder if there's some tuning needing to be done (disable boost, use performance governor) in cloud instances?

If metal instances are not feasible, does AWS provide some after-the-fact telemetry which could help us normalize the results? Cpu %steal, but is there even something for io?

Regardless of the how the benchies are run, we must validate them with these "assumed unchanged" vanilla runs as if the vanilla results deviate too much, just cancel the benchmark. This will break down more if the benchmarks run for too long though.

koivunej · 2023-03-14T15:12:32Z

Latest picture looks much better, bit noisy still.

LizardWizzard · 2023-03-14T15:21:04Z

As an idea I think it should be possible to run benchmarks on CI runners at least on weekends. There shouldnt be any activity there, and these instances are not on demand, so we already pay for them. It can run the suite and bisect if needed to find commits which decreased the performance over the week

koivunej added a/ci Area: related to continuous integration a/benchmark Area: related to benchmarking labels Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark instability and usefulness #3040

Benchmark instability and usefulness #3040

koivunej commented Dec 9, 2022 •

edited

knizhnik commented Dec 9, 2022

koivunej commented Dec 13, 2022

koivunej commented Mar 14, 2023

LizardWizzard commented Mar 14, 2023 •

edited

Benchmark instability and usefulness #3040

Benchmark instability and usefulness #3040

Comments

koivunej commented Dec 9, 2022 • edited

knizhnik commented Dec 9, 2022

koivunej commented Dec 13, 2022

koivunej commented Mar 14, 2023

LizardWizzard commented Mar 14, 2023 • edited

koivunej commented Dec 9, 2022 •

edited

LizardWizzard commented Mar 14, 2023 •

edited