Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark stats v2 #542

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

Erotemic
Copy link
Contributor

@Erotemic Erotemic commented May 26, 2022

This PR supersedes #532 and is based off of main, so it does not include extra PRs.

This is still a work in progress, but it is coming along, and it's starting to produce results. However, it is significantly more complex than the previous PR. However, this is justified based on the several design goals, are as follows:

  1. I want to show pretty graphs
  2. I want to output pretty tables
  3. I want to vary the size of the inputs of the current benchmark to show scalability
  4. I wanted repeated runs of the benchmarks to serialize their results and accumulate them so statistics from multiple runs (over multiple machines) can be accumulated and compared over time.
  5. I wanted to be able to run t-tests to quantify the probability that a performance regression has been introduced over different versions of ujson (or other json libraries for that matter).

To this end, I've started work on an experimental module I'm currently calling "benchmarker", which I've added to the test subdirectory. There is still a lot to clean up here, but the general structure is in place.

  • The benchmarker.py file contains a wrapper around timerit that makes it simpler to express benchmarks over a grid of varied parameters.

  • The process_context collects information about the machine hardware / software so each benchmark knows the context in which it was run.

  • The result_analysis.py file is ported from another project I'm working on that runs stats over a table of results. I was originally using this to compare hyperparameters wrt machine learning performance metrics, but it also applies when that performance metric is "time" and the hyperparameters are different libraries / inputs / settings. It's highly general.

  • The util_json script is for json utilities I need to ensure the benchmarks are properly serialized. It might be able to be removed as this PR matures and is focused on this use-case.

  • The aggregate and visualize scripts will probably go away. I'm keeping them in for now as I continue development.

The script that uses "bechmarker" is currently called "benchmark3.py" and it will be a superset of what "benchmark.py" currently does.

Here is the current state of the visualization:
image

Current state of the statistics (currently marginalized over all sizes / inputs, but that can be refined):

PARAMETER: impl - METRIC: mean_time
===================================
mean_time   count      mean       std           min       25%       50%       75%       max
impl                                                                                       
orjson       20.0  0.000043  0.000074  4.760000e-07  0.000002  0.000009  0.000049  0.000300
ujson        20.0  0.000182  0.000338  6.901000e-07  0.000004  0.000032  0.000185  0.001391
nujson       20.0  0.000304  0.000518  6.606000e-07  0.000005  0.000044  0.000255  0.001733
json         20.0  0.000361  0.000597  2.347300e-06  0.000010  0.000053  0.000451  0.002308
simplejson   20.0  0.000452  0.000710  3.685600e-06  0.000011  0.000056  0.000671  0.002632

ANOVA: If p is low, the param 'impl' might have an effect
  Rank-ANOVA: p=0.04903523
  Mean-ANOVA: p=0.09580734

Pairwise T-Tests
  If p is low, impl=orjson may outperform impl=ujson.
    ttest_ind:  p=0.04442984
  If p is low, impl=ujson may outperform impl=nujson.
    ttest_ind:  p=0.19194550
  If p is low, impl=nujson may outperform impl=json.
    ttest_ind:  p=0.37413442
  If p is low, impl=json may outperform impl=simplejson.
    ttest_ind:  p=0.33158304
  param_name     metric  anova_rank_H  anova_rank_p  anova_mean_F  anova_mean_p
0       impl  mean_time      9.534891      0.049035      2.033863      0.095807

And the OpenSkill analysis (which can be interpreted as the probability the chosen implementation / version will be fastest):

skillboard.ratings = {
    ('json', '2.0.9')       : Rating(mu=17.131378529535343, sigma=5.9439028153361395),
    ('nujson', '1.35.2')    : Rating(mu=11.38071601274192, sigma=6.063976195156715),
    ('orjson', '3.6.8')     : Rating(mu=60.1867466539136, sigma=7.080517291679502),
    ('simplejson', '3.17.6'): Rating(mu=3.5265870256595813, sigma=6.2092484551816725),
    ('ujson', '5.2.1.dev28'): Rating(mu=38.31167062257107, sigma=6.320048263068908),
}
win_probs = {
    ('json', '2.0.9'): 0.17368695029693812,
    ('nujson', '1.35.2'): 0.15447983652936728,
    ('orjson', '3.6.8'): 0.2981884925614975,
    ('simplejson', '3.17.6'): 0.12972249212852724,
    ('ujson', '5.2.1.dev28'): 0.2439222284836699,
}

I still need to:

  • Get the plots/analysis working with aggregated benchmark results
  • Reproduce the existing tables for the README (with percentage speedup / slowdown relative to ujson)
  • Clean up the plots, save them to disk, and make them work over multiple ujson versions.

Submitting this now as it is starting to come together, and I'd be interested in feedback on adding what effectively is a benchmarking system to the repo. I'm thinking "benchmarker" can eventually become a standalone repo that is included as a benchmark dependency, but setting up and maintaining a separate repo is an endeavor, so if possible, I'd like to "pre-vendor" it here as a staging area where it can (1) be immediately useful and (2) prove itself / work out the kinks.

@codecov-commenter
Copy link

codecov-commenter commented May 26, 2022

Codecov Report

Merging #542 (fd951a3) into main (ebdb150) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #542   +/-   ##
=======================================
  Coverage   91.76%   91.76%           
=======================================
  Files           6        6           
  Lines        1821     1821           
=======================================
  Hits         1671     1671           
  Misses        150      150           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebdb150...fd951a3. Read the comment docs.

@bwoodsend
Copy link
Collaborator

My goodness, you really are going for it.

I'm thinking "benchmarker" can eventually become a standalone repo that is included as a benchmark dependency, but setting up and maintaining a separate repo is an endeavor, so if possible, I'd like to "pre-vendor" it here as a staging area where it can (1) be immediately useful and (2) prove itself / work out the kinks.

No problem (as long as you're not put off by our inconsistent punctuality w.r.t reviewing).

Just some remarks based on experience trying to benchmark something (namely this thing) with pretty much exactly the same criteria of comparing the speed of different tasks of varying scale with different libraries and the previous version of the same library - take them or leave them: I generally found that taking the mean/standard deviation/significance testing or any kind of stats doesn't work for speed benchmarking. The collected times would be thousands of tiny times plus a very small number of enormous times where system interrupts or garbage collection happened so a mean of several thousand sample really ends up measuring whether said interrupts happened 0, 1 or 2 times which gives very poor reproducibility. The obvious fix for that is to brute force run so many samples that even with the interrupts it all averages out nicely but... if you leave a CPU intensive process running long enough you start to discover that the kernel keeps the CPU running slow by default to save power then progressively steps it up when it becomes evident that it's got a lot of processing to do so when you look at the raw times you see several discreet steps. If you don't look at the raw times and only consider means/standard deviations then you will observe that whichever test you run first is statistically significantly slower than anything ran immediately afterwards - even if test 2 is just a rerun of test 1.

What I have found works surprisingly well is to throw away statistics entirely (along with the rest of my maths degree) and just do a very dense scatter plot of every single run. i.e. x axis is size, y axis is time, use different colors for different libraries and set the point size to something tiny so that you can see the distribution of point densities.

(P.S. In terms of prettiness, plotly eats matplotlib+seaborn for breakfast.)

@JustAnotherArchivist
Copy link
Collaborator

For numerical comparisons (e.g. automatic regression tests?), quantiles could be used, which would be insensitive to outliers. E.g. something like the 50th (= median) and 95th quantiles. But plotting the full distribution of run times seems fine to me as well and might expose regressions in weird edge cases.

@Erotemic
Copy link
Contributor Author

@bwoodsend wrt to other widely used foss projects, 6 days is a pretty fast latency. 😄 This PR (astanin/python-tabulate#152) is the wait time I'm used to.

The comments are helpful. It will take me time to digest some of the stats implications, but I'll add a bit to the motivation. I think it makes sense to have multiple ways of viewing the analysis, which is why I've chosen t-tests, openskill, and line/scatterplot visualization. When they all agree you can increase your confidence you are doing the stats right.

One more comment on the stats: I think by having access to multiple runs by serializing the results of each run and then combining the stats will show a clearer picture. It's possible combine the means / stds of multiple runs grouped by some criteria. If the stats don't show what we see when we look at the graphs, then I'm doing the stats wrong. It's tricky stuff, so I suppose that's to be expected. I feel like I have a good handle on the stats tools I've chosen to use. I think I need some more work on the paired t-tests, but that is more important to my other projects that this.

I think adding the scatter plot on top of the line plots makes sense.

WRT to plotly, it may be nicer, but seaborn is 100% free, so... tradeoffs.

Lastly, recent updates:

I've reproduced this table:

|                                       |   ('json', '2.0.9') |   ('nujson', '1.35.2') |   ('orjson', '3.6.8') |   ('simplejson', '3.17.6') |   ('ujson', '5.2.1.dev14') |
|:--------------------------------------|--------------------:|-----------------------:|----------------------:|---------------------------:|---------------------------:|
| ('Array of Dict[str, int]', 'dumps')  |           13,982.71 |              29,995.06 |            100,326.48 |                   6,950.44 |                  24,839.14 |
| ('Array of Dict[str, int]', 'loads')  |           18,991.77 |              21,741.87 |             26,222.97 |                  14,171.84 |                  24,309.13 |
| ('Array with True values', 'dumps')   |          144,493.71 |             208,263.19 |            674,555.22 |                 134,410.41 |                 242,835.84 |
| ('Array with True values', 'loads')   |          266,209.97 |             296,160.28 |            419,554.01 |                 247,701.54 |                 376,051.77 |
| ('Array with UTF-8 strings', 'dumps') |            6,114.59 |               9,556.98 |             30,493.62 |                   5,782.16 |                   9,659.15 |
| ('Array with UTF-8 strings', 'loads') |            3,568.77 |               3,754.10 |              2,491.08 |                     809.93 |                   3,835.32 |
| ('Array with doubles', 'dumps')       |            8,085.31 |              11,804.69 |            138,655.55 |                   7,818.05 |                  41,561.26 |
| ('Array with doubles', 'loads')       |           22,700.43 |              83,248.75 |             90,922.52 |                  20,647.65 |                  56,982.90 |
| ('Complex object', 'dumps')           |              800.34 |               1,091.65 |                nan    |                     795.30 |                   1,154.98 |
| ('Complex object', 'loads')           |              558.49 |                 804.03 |                nan    |                     302.57 |                     817.55 |

And I've added a speedup table:

|                                       |   ('json', '2.0.9') |   ('nujson', '1.35.2') |   ('orjson', '3.6.8') |   ('simplejson', '3.17.6') |   ('ujson', '5.2.1.dev14') |
|:--------------------------------------|--------------------:|-----------------------:|----------------------:|---------------------------:|---------------------------:|
| ('Array of Dict[str, int]', 'dumps')  |                   1 |                    2.1 |                   7.2 |                       0.5  |                        1.8 |
| ('Array of Dict[str, int]', 'loads')  |                   1 |                    1.2 |                   1.4 |                       0.75 |                        1.3 |
| ('Array with True values', 'dumps')   |                   1 |                    1.4 |                   4.8 |                       0.93 |                        1.7 |
| ('Array with True values', 'loads')   |                   1 |                    1.1 |                   1.6 |                       0.93 |                        1.4 |
| ('Array with UTF-8 strings', 'dumps') |                   1 |                    1.6 |                   5   |                       0.96 |                        1.6 |
| ('Array with UTF-8 strings', 'loads') |                   1 |                    1.1 |                   0.7 |                       0.23 |                        1.1 |
| ('Array with doubles', 'dumps')       |                   1 |                    1.5 |                  17   |                       0.97 |                        5.2 |
| ('Array with doubles', 'loads')       |                   1 |                    3.7 |                   4   |                       0.92 |                        2.5 |
| ('Complex object', 'dumps')           |                   1 |                    1.4 |                 nan   |                       0.99 |                        1.4 |
| ('Complex object', 'loads')           |                   1 |                    1.4 |                 nan   |                       0.54 |                        1.5 |

To give a sense of how the scatter plot looks for one machine (note I ran the benchmarks under different conditions, which is something I do want to allow for)

image

image

I think it will be useful to overlay the scatterplot on top of the line plot. Unfortunately seaborn does not have an option for this. I also have to rework the error intervals around each line. I think seaborn assumes the rows are all pointwise measures, but in this case its useful to compress multiple stats into one row, so I need to tweak the plots a bit to reflect the actual std, and not the std of the means (which it currently does).

@Erotemic
Copy link
Contributor Author

Notes to self:

@hugovk hugovk added the changelog: Changed For changes in existing functionality label Jan 17, 2024
@bwoodsend
Copy link
Collaborator

Is this ready for review then?

@Erotemic
Copy link
Contributor Author

I've lost the thread on what I was doing here. I have a copy of result_analysis.py in a different project, noticed a bug, and then fixed it in this place as well. There were also uncommited changes that were in the last commit.

Looking at my original post, my goals for this are fairly ambitious, and I'm not sure I'll have the time to polish the full: run many times on different hardware and aggretate results feature set, but if I restrict focus to just producing a single set of benchmarks (i.e. suitable for graphs that you might show in a README), I might have some time to ensure that has a nice UX.

So, no. It's not ready for review. But if you're still interested in the feature, I can try to find time to pick this PR back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: Changed For changes in existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants