New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark stats v2 #542
base: main
Are you sure you want to change the base?
Benchmark stats v2 #542
Conversation
Codecov Report
@@ Coverage Diff @@
## main #542 +/- ##
=======================================
Coverage 91.76% 91.76%
=======================================
Files 6 6
Lines 1821 1821
=======================================
Hits 1671 1671
Misses 150 150 Continue to review full report at Codecov.
|
de98a60
to
4d0f705
Compare
for more information, see https://pre-commit.ci
My goodness, you really are going for it.
No problem (as long as you're not put off by our inconsistent punctuality w.r.t reviewing). Just some remarks based on experience trying to benchmark something (namely this thing) with pretty much exactly the same criteria of comparing the speed of different tasks of varying scale with different libraries and the previous version of the same library - take them or leave them: I generally found that taking the mean/standard deviation/significance testing or any kind of stats doesn't work for speed benchmarking. The collected times would be thousands of tiny times plus a very small number of enormous times where system interrupts or garbage collection happened so a mean of several thousand sample really ends up measuring whether said interrupts happened 0, 1 or 2 times which gives very poor reproducibility. The obvious fix for that is to brute force run so many samples that even with the interrupts it all averages out nicely but... if you leave a CPU intensive process running long enough you start to discover that the kernel keeps the CPU running slow by default to save power then progressively steps it up when it becomes evident that it's got a lot of processing to do so when you look at the raw times you see several discreet steps. If you don't look at the raw times and only consider means/standard deviations then you will observe that whichever test you run first is statistically significantly slower than anything ran immediately afterwards - even if test 2 is just a rerun of test 1. What I have found works surprisingly well is to throw away statistics entirely (along with the rest of my maths degree) and just do a very dense scatter plot of every single run. i.e. x axis is (P.S. In terms of prettiness, plotly eats matplotlib+seaborn for breakfast.) |
For numerical comparisons (e.g. automatic regression tests?), quantiles could be used, which would be insensitive to outliers. E.g. something like the 50th (= median) and 95th quantiles. But plotting the full distribution of run times seems fine to me as well and might expose regressions in weird edge cases. |
for more information, see https://pre-commit.ci
@bwoodsend wrt to other widely used foss projects, 6 days is a pretty fast latency. 😄 This PR (astanin/python-tabulate#152) is the wait time I'm used to. The comments are helpful. It will take me time to digest some of the stats implications, but I'll add a bit to the motivation. I think it makes sense to have multiple ways of viewing the analysis, which is why I've chosen t-tests, openskill, and line/scatterplot visualization. When they all agree you can increase your confidence you are doing the stats right. One more comment on the stats: I think by having access to multiple runs by serializing the results of each run and then combining the stats will show a clearer picture. It's possible combine the means / stds of multiple runs grouped by some criteria. If the stats don't show what we see when we look at the graphs, then I'm doing the stats wrong. It's tricky stuff, so I suppose that's to be expected. I feel like I have a good handle on the stats tools I've chosen to use. I think I need some more work on the paired t-tests, but that is more important to my other projects that this. I think adding the scatter plot on top of the line plots makes sense. WRT to plotly, it may be nicer, but seaborn is 100% free, so... tradeoffs. Lastly, recent updates: I've reproduced this table:
And I've added a speedup table:
To give a sense of how the scatter plot looks for one machine (note I ran the benchmarks under different conditions, which is something I do want to allow for) I think it will be useful to overlay the scatterplot on top of the line plot. Unfortunately seaborn does not have an option for this. I also have to rework the error intervals around each line. I think seaborn assumes the rows are all pointwise measures, but in this case its useful to compress multiple stats into one row, so I need to tweak the plots a bit to reflect the actual std, and not the std of the means (which it currently does). |
…to benchmark_stats_v2
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Notes to self:
|
Is this ready for review then? |
I've lost the thread on what I was doing here. I have a copy of result_analysis.py in a different project, noticed a bug, and then fixed it in this place as well. There were also uncommited changes that were in the last commit. Looking at my original post, my goals for this are fairly ambitious, and I'm not sure I'll have the time to polish the full: run many times on different hardware and aggretate results feature set, but if I restrict focus to just producing a single set of benchmarks (i.e. suitable for graphs that you might show in a README), I might have some time to ensure that has a nice UX. So, no. It's not ready for review. But if you're still interested in the feature, I can try to find time to pick this PR back up. |
This PR supersedes #532 and is based off of main, so it does not include extra PRs.
This is still a work in progress, but it is coming along, and it's starting to produce results. However, it is significantly more complex than the previous PR. However, this is justified based on the several design goals, are as follows:
To this end, I've started work on an experimental module I'm currently calling "benchmarker", which I've added to the test subdirectory. There is still a lot to clean up here, but the general structure is in place.
The benchmarker.py file contains a wrapper around
timerit
that makes it simpler to express benchmarks over a grid of varied parameters.The process_context collects information about the machine hardware / software so each benchmark knows the context in which it was run.
The result_analysis.py file is ported from another project I'm working on that runs stats over a table of results. I was originally using this to compare hyperparameters wrt machine learning performance metrics, but it also applies when that performance metric is "time" and the hyperparameters are different libraries / inputs / settings. It's highly general.
The util_json script is for json utilities I need to ensure the benchmarks are properly serialized. It might be able to be removed as this PR matures and is focused on this use-case.
The aggregate and visualize scripts will probably go away. I'm keeping them in for now as I continue development.
The script that uses "bechmarker" is currently called "benchmark3.py" and it will be a superset of what "benchmark.py" currently does.
Here is the current state of the visualization:
Current state of the statistics (currently marginalized over all sizes / inputs, but that can be refined):
And the OpenSkill analysis (which can be interpreted as the probability the chosen implementation / version will be fastest):
I still need to:
Submitting this now as it is starting to come together, and I'd be interested in feedback on adding what effectively is a benchmarking system to the repo. I'm thinking "benchmarker" can eventually become a standalone repo that is included as a benchmark dependency, but setting up and maintaining a separate repo is an endeavor, so if possible, I'd like to "pre-vendor" it here as a staging area where it can (1) be immediately useful and (2) prove itself / work out the kinks.