You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similarly for test_copy[vanilla].copy_extend I found (for same 4 points):
12.880s
12.168s
11.785s
4.82s
I understand that vanilla benchmarks are expected to be more stable than our changes, so they could be used to find some sort normalization or noise floor for the results making them comparable.
Further, looking at the database where we store the results, here's a histogram for the test_bulk_insert[vanilla] over last 100 (covering 2022-12-08 to 2022-11-07) and all metrics (843, from 2022-02-15):
I don't think it's feasible, it already takes a very long time to run the suite (1h).
Metal instances would be an easy way, but costly. Though we don't need the largest box, we just want definable noise level.
I do wonder if all of the benchmarks should be started with flushed vm caches as well, not to have any interference from previous runs. Similarly to local benchmarking, I wonder if there's some tuning needing to be done (disable boost, use performance governor) in cloud instances?
If metal instances are not feasible, does AWS provide some after-the-fact telemetry which could help us normalize the results? Cpu %steal, but is there even something for io?
Regardless of the how the benchies are run, we must validate them with these "assumed unchanged" vanilla runs as if the vanilla results deviate too much, just cancel the benchmark. This will break down more if the benchmarks run for too long though.
As an idea I think it should be possible to run benchmarks on CI runners at least on weekends. There shouldnt be any activity there, and these instances are not on demand, so we already pay for them. It can run the suite and bisect if needed to find commits which decreased the performance over the week
While using the label
run-benchmarks
for #2873 and #2875, I noticed thattest_bulk_insert[vanilla]
test case was producing some interesting numbers:Similarly for
test_copy[vanilla].copy_extend
I found (for same 4 points):I understand that
vanilla
benchmarks are expected to be more stable than our changes, so they could be used to find some sort normalization or noise floor for the results making them comparable.Further, looking at the database where we store the results, here's a histogram for the
test_bulk_insert[vanilla]
over last 100 (covering 2022-12-08 to 2022-11-07) and all metrics (843, from 2022-02-15):I did not look at all of the results this closely, mostly just the 4 points you can find in a google sheet.
I think this means, that we cannot really make any decisions based on the benchmarks as they are right now, because the noise level is so high.
The text was updated successfully, but these errors were encountered: