Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark instability and usefulness #3040

Open
koivunej opened this issue Dec 9, 2022 · 4 comments
Open

Benchmark instability and usefulness #3040

koivunej opened this issue Dec 9, 2022 · 4 comments
Labels
a/benchmark Area: related to benchmarking a/ci Area: related to continuous integration

Comments

@koivunej
Copy link
Contributor

koivunej commented Dec 9, 2022

While using the label run-benchmarks for #2873 and #2875, I noticed that test_bulk_insert[vanilla] test case was producing some interesting numbers:

  1. 11.144s for 145e7e4
  2. 15.514s for Tokio based walredo #2875 rebased on above commit
  3. 11.898s for a46a81b
  4. 17.411s for Tokio based walredo #2875 rebased on above commit

Similarly for test_copy[vanilla].copy_extend I found (for same 4 points):

  1. 12.880s
  2. 12.168s
  3. 11.785s
  4. 4.82s

I understand that vanilla benchmarks are expected to be more stable than our changes, so they could be used to find some sort normalization or noise floor for the results making them comparable.

Further, looking at the database where we store the results, here's a histogram for the test_bulk_insert[vanilla] over last 100 (covering 2022-12-08 to 2022-11-07) and all metrics (843, from 2022-02-15):

histogram

I did not look at all of the results this closely, mostly just the 4 points you can find in a google sheet.

I think this means, that we cannot really make any decisions based on the benchmarks as they are right now, because the noise level is so high.

@koivunej koivunej added a/ci Area: related to continuous integration a/benchmark Area: related to benchmarking labels Dec 9, 2022
@knizhnik
Copy link
Contributor

knizhnik commented Dec 9, 2022

To make results more stable we need to run this tests for much more time. Not sure that it is possible.

@koivunej
Copy link
Contributor Author

run this tests for much more time

I don't think it's feasible, it already takes a very long time to run the suite (1h).

Metal instances would be an easy way, but costly. Though we don't need the largest box, we just want definable noise level.

I do wonder if all of the benchmarks should be started with flushed vm caches as well, not to have any interference from previous runs. Similarly to local benchmarking, I wonder if there's some tuning needing to be done (disable boost, use performance governor) in cloud instances?

If metal instances are not feasible, does AWS provide some after-the-fact telemetry which could help us normalize the results? Cpu %steal, but is there even something for io?

Regardless of the how the benchies are run, we must validate them with these "assumed unchanged" vanilla runs as if the vanilla results deviate too much, just cancel the benchmark. This will break down more if the benchmarks run for too long though.

@koivunej
Copy link
Contributor Author

histogram-2023-03-14

Latest picture looks much better, bit noisy still.

@LizardWizzard
Copy link
Contributor

LizardWizzard commented Mar 14, 2023

As an idea I think it should be possible to run benchmarks on CI runners at least on weekends. There shouldnt be any activity there, and these instances are not on demand, so we already pay for them. It can run the suite and bisect if needed to find commits which decreased the performance over the week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/benchmark Area: related to benchmarking a/ci Area: related to continuous integration
Projects
None yet
Development

No branches or pull requests

3 participants