Add an API to compare benchmark results #5297

sheremet-va · 2024-02-26T08:26:23Z

Clear and concise description of the problem

There is currently no way to compare benchmark results between different runs. For example, it's possible to change the implementation details without changing the benchmark, and there is no way to confirm that it didn't introduce any regressions except for manually making screenshots in the terminal.

Alternative

No response

Additional context

I think it should print results the same way we do now in a table, just compare it also with the previously stored results.

I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.

Validations

Follow our Code of Conduct
Read the Contributing Guidelines.
Read the docs.
Check that there isn't already an issue that request the same feature to avoid creating a duplicate.

The text was updated successfully, but these errors were encountered:

mrazauskas · 2024-02-26T14:13:43Z

What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.

sheremet-va · 2024-02-26T14:21:18Z

What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.

These are all external tools and they are meant to be used in CI. Since we don't control what machine is running the benchmark, this issue is only about manual check on your own machine.

It's about iterating on the same function, not about catching regressions.

mrazauskas · 2024-02-26T14:51:13Z

It's about iterating on the same function, not about catching regressions.

Sort of comparing performance between two branches? Would that be possible without JSON being written to disk?

sheremet-va · 2024-02-26T14:53:30Z

Would that be possible without JSON being written to disk?

The idea I am describing in the issue is to use the result from the json reporter which dumps the values on the disk already.

hi-ogawa · 2024-02-27T01:20:40Z

Interesting idea!

Speaking of --compare flag, I rememberd this webpack plugin which has a similar idea and compares the bundle stats between the current run and baseline https://github.com/relative-ci/bundle-stats/tree/master/packages/webpack-plugin#bundlestatswebpackpluginoptions

It looks like their approach is that they require either baseline: true or BUNDLE_STATS_BASELINE=true to explicitly dump the current result as a baseline, then such baseline is automatically used to compare against for following runs since compare: true by default.

sheremet-va · 2024-03-14T15:33:26Z

We already support benchmark json output. We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.

antoine-coulon · 2024-03-15T07:32:18Z

Hello @sheremet-va

That's actually a great idea, I'm working on skott and especially on a pull request that is meant to compare skott's performances over time when running analysis on specific JS/TS projects.

The current way I'm implementing result comparison is by only using n and n-1 versions using git diff to witness the changes, but it would be great to have a complete history indeed and store more than one previous version. For that I tried codspeed but the CI just takes a while, I don't know if it's related to the free-tier or what but the job is taking more than 1h50 to complete while just using vitest bench API takes few minutes so it's not an option for me at all.

I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.

To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.

I feel like trying to diff a set of benchmarks add OS properties in the output might be confusing, you might want to do that at the filename level so that each OS combination can get its own diff?

Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.

So at some point I was kind of wondering, should I create a vitest plugin and do something like codspeed but instead of storing the data on my private cloud, just emit the data files in a dedicated location and provide a way to compare them over time. But if you're willing to integrate that in the core, it might not be that relevant? What do you think?

sheremet-va · 2024-03-15T09:40:51Z

To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.

Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.

Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.

How do you store the generated output in CI?

antoine-coulon · 2024-03-15T10:32:02Z

Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.

Yeah better have more information than not enough. Also I'm not sure how stable the GitHub Actions hardware is when not using custom pools, default pools might have agents with more or less cores, so variations can even happen there indeed.

How do you store the generated output in CI?

For now it's only a JSON file written and committed by the CI in my repo (at a specific location near-by the benchmark files themselves) but it could be more sophisticated.

Consequently for now I'm not storing the history of all benchmarks, I'm just overriding it each time a new PR runs with the new results, allowing me to do the diff between n and n-1 versions. But it could be great to keep track of all the benchmarks nonetheless to track big regressions/improvements over time, n and n-1 is a very small dataset

hi-ogawa · 2024-03-17T23:48:58Z

Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter:
https://stackblitz.com/edit/vitest-dev-vitest-mm6syc?file=vite.config.ts

demo

# I used VITEST_BENCH_xxx env var for prototype

# save bench data on main branch
# --benchmark.outputFile=main.json
$ VITEST_BENCH_OUTPUT_FILE=main.json npx vitest bench --run
...

# suppose you switched a branch and compare against main
# --benchmark.compare=main.json
$ VITEST_BENCH_COMPARE=main.json npx vitest bench --run
...
 RUN  v1.4.0 /home/projects/vitest-dev-vitest-mm6syc

[... here current deafult reporter ....]
 ✓ test/basic.bench.ts (2) 2483ms
   ✓ sort (2) 2471ms
     name          hz      min      max     mean      p75      p99     p995     p999      rme  samples
   · normal   58.7590  13.7350  40.1650  17.0187  16.4350  40.1650  40.1650  40.1650  ±13.39%       30   fastest
   · reverse  10.6565  74.6000   115.93  93.8395   114.16   115.93   115.93   115.93  ±14.26%       10  

[... custom compare reporter ....]
 [BENCH] Comparison
   current  : bench-default.json
   baseline : main.json

sort
  normal 58.759hz [baseline: 73.419hz] [change: 0.80x ⇓]
  reverse 10.656hz [baseline: 10.870hz] [change: 0.98x ⇓]

A few things I noticed:

isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual. But being able to quickly choose --compare from cli would be important, so I think it's fine to make this special flag.
currently json reporter somehow lacks bench filename (the data is only { suitename: BenchmarkResult[] }). So, to persist previous run, probably we can forget about current json reporter and start with better internal format. For this, we can also exclude gigantic samples because I don't think it's needed for comparison report.

FYI, I was also looking around prior arts and for example this one https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines has three flags for comparison purpose:

--save-baseline: like --benchmark.outputFile
--baseline: like a proposed --compare
--load-baseline: this allows skipping running benchmark and loading existing result as a current run.

hi-ogawa · 2024-03-18T00:03:35Z

We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.

@sheremet-va Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?

sheremet-va · 2024-03-18T08:52:41Z

Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter

If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.

Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?

I wanted to make sure that tinybench supports providing results based on benchmarks that are not actually running.

isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual.

If this is a first-class feature, I think it is fine. We already expose some flags that are only relevant for reporters.

currently json reporter somehow lacks bench filename

We can change the format of json. Benchmark is an experimental feature and doesn't follow semver.

hi-ogawa · 2024-03-18T09:46:27Z

If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.

What you mean by "sorted'? I don't know about how current default tty reporter works, but if comparison summary is required only at the end like in my prototype, then it has complete information as onFinished(files: File[]) where Test.result.benchmark holds full tinybench output TaskResult.

sheremet-va · 2024-03-18T11:05:41Z

What you mean by "sorted'?

The result in the suite.result.benchmark is sorted based on the median result (fastest first). I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.

This is just my expectation based on what we already do. If other frameworks do it differently and more ergonomically, then we can follow them, but I would expect to see the difference in all table columns, no?

hi-ogawa · 2024-03-18T11:25:02Z

I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.

I might be still missing something, but let me confirm with my PR. My proof of concept is currently separated at the end, but I think it's possible to show together in the current table like you expect without changing anything on tinybench. Let me try that in the PR.

hi-ogawa · 2024-03-19T07:56:47Z

It's a rough mockup, but it's possible to create a table like this:

@sheremet-va Is the direction okay or did you have something else in your mind?

sheremet-va · 2024-03-26T14:40:46Z

Is the direction okay or did you have something else in your mind?

Yes, this is how I imagined it. I am not qualified enough to tell if it's better this way, so I would wait for more feedback.

What we can also do is reverse the table and duplicate the column instead:

https://github.com/google/benchmark/blob/main/docs/tools.md#modes-of-operation

Also, maybe we should print two tables: the first is the same as we do now, and the second one has a difference in each field:

Side note: would be awesome if we could create a graph 👀
https://pytest-benchmark.readthedocs.io/en/latest/comparing.html

sheremet-va added feat: benchmark Issues and PRs related to the benchmark feature p3-significant High priority enhancement (priority) pr welcome labels Feb 26, 2024

sheremet-va mentioned this issue Mar 14, 2024

Benchmark with json reporter and outputFile causes OOM crash #5349

Closed

6 tasks

hi-ogawa mentioned this issue Mar 18, 2024

feat(benchmark): support comparing benchmark result #5398

Merged

14 tasks

sheremet-va closed this as completed in #5398 May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an API to compare benchmark results #5297

Add an API to compare benchmark results #5297

sheremet-va commented Feb 26, 2024 •

edited

mrazauskas commented Feb 26, 2024

sheremet-va commented Feb 26, 2024 •

edited

mrazauskas commented Feb 26, 2024

sheremet-va commented Feb 26, 2024

hi-ogawa commented Feb 27, 2024

sheremet-va commented Mar 14, 2024

antoine-coulon commented Mar 15, 2024

sheremet-va commented Mar 15, 2024

antoine-coulon commented Mar 15, 2024

hi-ogawa commented Mar 17, 2024 •

edited

hi-ogawa commented Mar 18, 2024

sheremet-va commented Mar 18, 2024 •

edited

hi-ogawa commented Mar 18, 2024

sheremet-va commented Mar 18, 2024 •

edited

hi-ogawa commented Mar 18, 2024 •

edited

hi-ogawa commented Mar 19, 2024

sheremet-va commented Mar 26, 2024 •

edited

Add an API to compare benchmark results #5297

Add an API to compare benchmark results #5297

Comments

sheremet-va commented Feb 26, 2024 • edited

Clear and concise description of the problem

Suggested solution

Alternative

Additional context

Validations

mrazauskas commented Feb 26, 2024

sheremet-va commented Feb 26, 2024 • edited

mrazauskas commented Feb 26, 2024

sheremet-va commented Feb 26, 2024

hi-ogawa commented Feb 27, 2024

sheremet-va commented Mar 14, 2024

antoine-coulon commented Mar 15, 2024

sheremet-va commented Mar 15, 2024

antoine-coulon commented Mar 15, 2024

hi-ogawa commented Mar 17, 2024 • edited

hi-ogawa commented Mar 18, 2024

sheremet-va commented Mar 18, 2024 • edited

hi-ogawa commented Mar 18, 2024

sheremet-va commented Mar 18, 2024 • edited

hi-ogawa commented Mar 18, 2024 • edited

hi-ogawa commented Mar 19, 2024

sheremet-va commented Mar 26, 2024 • edited

sheremet-va commented Feb 26, 2024 •

edited

sheremet-va commented Feb 26, 2024 •

edited

hi-ogawa commented Mar 17, 2024 •

edited

sheremet-va commented Mar 18, 2024 •

edited

sheremet-va commented Mar 18, 2024 •

edited

hi-ogawa commented Mar 18, 2024 •

edited

sheremet-va commented Mar 26, 2024 •

edited