Benchmark stats v2 #542

Erotemic · 2022-05-26T12:23:45Z

This PR supersedes #532 and is based off of main, so it does not include extra PRs.

This is still a work in progress, but it is coming along, and it's starting to produce results. However, it is significantly more complex than the previous PR. However, this is justified based on the several design goals, are as follows:

I want to show pretty graphs
I want to output pretty tables
I want to vary the size of the inputs of the current benchmark to show scalability
I wanted repeated runs of the benchmarks to serialize their results and accumulate them so statistics from multiple runs (over multiple machines) can be accumulated and compared over time.
I wanted to be able to run t-tests to quantify the probability that a performance regression has been introduced over different versions of ujson (or other json libraries for that matter).

To this end, I've started work on an experimental module I'm currently calling "benchmarker", which I've added to the test subdirectory. There is still a lot to clean up here, but the general structure is in place.

The benchmarker.py file contains a wrapper around timerit that makes it simpler to express benchmarks over a grid of varied parameters.
The process_context collects information about the machine hardware / software so each benchmark knows the context in which it was run.
The result_analysis.py file is ported from another project I'm working on that runs stats over a table of results. I was originally using this to compare hyperparameters wrt machine learning performance metrics, but it also applies when that performance metric is "time" and the hyperparameters are different libraries / inputs / settings. It's highly general.
The util_json script is for json utilities I need to ensure the benchmarks are properly serialized. It might be able to be removed as this PR matures and is focused on this use-case.
The aggregate and visualize scripts will probably go away. I'm keeping them in for now as I continue development.

The script that uses "bechmarker" is currently called "benchmark3.py" and it will be a superset of what "benchmark.py" currently does.

Here is the current state of the visualization:

Current state of the statistics (currently marginalized over all sizes / inputs, but that can be refined):

PARAMETER: impl - METRIC: mean_time
===================================
mean_time   count      mean       std           min       25%       50%       75%       max
impl                                                                                       
orjson       20.0  0.000043  0.000074  4.760000e-07  0.000002  0.000009  0.000049  0.000300
ujson        20.0  0.000182  0.000338  6.901000e-07  0.000004  0.000032  0.000185  0.001391
nujson       20.0  0.000304  0.000518  6.606000e-07  0.000005  0.000044  0.000255  0.001733
json         20.0  0.000361  0.000597  2.347300e-06  0.000010  0.000053  0.000451  0.002308
simplejson   20.0  0.000452  0.000710  3.685600e-06  0.000011  0.000056  0.000671  0.002632

ANOVA: If p is low, the param 'impl' might have an effect
  Rank-ANOVA: p=0.04903523
  Mean-ANOVA: p=0.09580734

Pairwise T-Tests
  If p is low, impl=orjson may outperform impl=ujson.
    ttest_ind:  p=0.04442984
  If p is low, impl=ujson may outperform impl=nujson.
    ttest_ind:  p=0.19194550
  If p is low, impl=nujson may outperform impl=json.
    ttest_ind:  p=0.37413442
  If p is low, impl=json may outperform impl=simplejson.
    ttest_ind:  p=0.33158304
  param_name     metric  anova_rank_H  anova_rank_p  anova_mean_F  anova_mean_p
0       impl  mean_time      9.534891      0.049035      2.033863      0.095807

And the OpenSkill analysis (which can be interpreted as the probability the chosen implementation / version will be fastest):

skillboard.ratings = {
    ('json', '2.0.9')       : Rating(mu=17.131378529535343, sigma=5.9439028153361395),
    ('nujson', '1.35.2')    : Rating(mu=11.38071601274192, sigma=6.063976195156715),
    ('orjson', '3.6.8')     : Rating(mu=60.1867466539136, sigma=7.080517291679502),
    ('simplejson', '3.17.6'): Rating(mu=3.5265870256595813, sigma=6.2092484551816725),
    ('ujson', '5.2.1.dev28'): Rating(mu=38.31167062257107, sigma=6.320048263068908),
}
win_probs = {
    ('json', '2.0.9'): 0.17368695029693812,
    ('nujson', '1.35.2'): 0.15447983652936728,
    ('orjson', '3.6.8'): 0.2981884925614975,
    ('simplejson', '3.17.6'): 0.12972249212852724,
    ('ujson', '5.2.1.dev28'): 0.2439222284836699,
}

I still need to:

Get the plots/analysis working with aggregated benchmark results
Reproduce the existing tables for the README (with percentage speedup / slowdown relative to ujson)
Clean up the plots, save them to disk, and make them work over multiple ujson versions.

Submitting this now as it is starting to come together, and I'd be interested in feedback on adding what effectively is a benchmarking system to the repo. I'm thinking "benchmarker" can eventually become a standalone repo that is included as a benchmark dependency, but setting up and maintaining a separate repo is an endeavor, so if possible, I'd like to "pre-vendor" it here as a staging area where it can (1) be immediately useful and (2) prove itself / work out the kinks.

codecov-commenter · 2022-05-26T12:25:06Z

Codecov Report

Merging #542 (fd951a3) into main (ebdb150) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #542   +/-   ##
=======================================
  Coverage   91.76%   91.76%           
=======================================
  Files           6        6           
  Lines        1821     1821           
=======================================
  Hits         1671     1671           
  Misses        150      150

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebdb150...fd951a3. Read the comment docs.

for more information, see https://pre-commit.ci

bwoodsend · 2022-05-28T16:04:47Z

My goodness, you really are going for it.

I'm thinking "benchmarker" can eventually become a standalone repo that is included as a benchmark dependency, but setting up and maintaining a separate repo is an endeavor, so if possible, I'd like to "pre-vendor" it here as a staging area where it can (1) be immediately useful and (2) prove itself / work out the kinks.

No problem (as long as you're not put off by our inconsistent punctuality w.r.t reviewing).

Just some remarks based on experience trying to benchmark something (namely this thing) with pretty much exactly the same criteria of comparing the speed of different tasks of varying scale with different libraries and the previous version of the same library - take them or leave them: I generally found that taking the mean/standard deviation/significance testing or any kind of stats doesn't work for speed benchmarking. The collected times would be thousands of tiny times plus a very small number of enormous times where system interrupts or garbage collection happened so a mean of several thousand sample really ends up measuring whether said interrupts happened 0, 1 or 2 times which gives very poor reproducibility. The obvious fix for that is to brute force run so many samples that even with the interrupts it all averages out nicely but... if you leave a CPU intensive process running long enough you start to discover that the kernel keeps the CPU running slow by default to save power then progressively steps it up when it becomes evident that it's got a lot of processing to do so when you look at the raw times you see several discreet steps. If you don't look at the raw times and only consider means/standard deviations then you will observe that whichever test you run first is statistically significantly slower than anything ran immediately afterwards - even if test 2 is just a rerun of test 1.

What I have found works surprisingly well is to throw away statistics entirely (along with the rest of my maths degree) and just do a very dense scatter plot of every single run. i.e. x axis is size, y axis is time, use different colors for different libraries and set the point size to something tiny so that you can see the distribution of point densities.

(P.S. In terms of prettiness, plotly eats matplotlib+seaborn for breakfast.)

JustAnotherArchivist · 2022-05-28T17:41:43Z

For numerical comparisons (e.g. automatic regression tests?), quantiles could be used, which would be insensitive to outliers. E.g. something like the 50th (= median) and 95th quantiles. But plotting the full distribution of run times seems fine to me as well and might expose regressions in weird edge cases.

for more information, see https://pre-commit.ci

Erotemic · 2022-05-29T02:10:49Z

@bwoodsend wrt to other widely used foss projects, 6 days is a pretty fast latency. 😄 This PR (astanin/python-tabulate#152) is the wait time I'm used to.

The comments are helpful. It will take me time to digest some of the stats implications, but I'll add a bit to the motivation. I think it makes sense to have multiple ways of viewing the analysis, which is why I've chosen t-tests, openskill, and line/scatterplot visualization. When they all agree you can increase your confidence you are doing the stats right.

One more comment on the stats: I think by having access to multiple runs by serializing the results of each run and then combining the stats will show a clearer picture. It's possible combine the means / stds of multiple runs grouped by some criteria. If the stats don't show what we see when we look at the graphs, then I'm doing the stats wrong. It's tricky stuff, so I suppose that's to be expected. I feel like I have a good handle on the stats tools I've chosen to use. I think I need some more work on the paired t-tests, but that is more important to my other projects that this.

I think adding the scatter plot on top of the line plots makes sense.

WRT to plotly, it may be nicer, but seaborn is 100% free, so... tradeoffs.

Lastly, recent updates:

I've reproduced this table:

|                                       |   ('json', '2.0.9') |   ('nujson', '1.35.2') |   ('orjson', '3.6.8') |   ('simplejson', '3.17.6') |   ('ujson', '5.2.1.dev14') |
|:--------------------------------------|--------------------:|-----------------------:|----------------------:|---------------------------:|---------------------------:|
| ('Array of Dict[str, int]', 'dumps')  |           13,982.71 |              29,995.06 |            100,326.48 |                   6,950.44 |                  24,839.14 |
| ('Array of Dict[str, int]', 'loads')  |           18,991.77 |              21,741.87 |             26,222.97 |                  14,171.84 |                  24,309.13 |
| ('Array with True values', 'dumps')   |          144,493.71 |             208,263.19 |            674,555.22 |                 134,410.41 |                 242,835.84 |
| ('Array with True values', 'loads')   |          266,209.97 |             296,160.28 |            419,554.01 |                 247,701.54 |                 376,051.77 |
| ('Array with UTF-8 strings', 'dumps') |            6,114.59 |               9,556.98 |             30,493.62 |                   5,782.16 |                   9,659.15 |
| ('Array with UTF-8 strings', 'loads') |            3,568.77 |               3,754.10 |              2,491.08 |                     809.93 |                   3,835.32 |
| ('Array with doubles', 'dumps')       |            8,085.31 |              11,804.69 |            138,655.55 |                   7,818.05 |                  41,561.26 |
| ('Array with doubles', 'loads')       |           22,700.43 |              83,248.75 |             90,922.52 |                  20,647.65 |                  56,982.90 |
| ('Complex object', 'dumps')           |              800.34 |               1,091.65 |                nan    |                     795.30 |                   1,154.98 |
| ('Complex object', 'loads')           |              558.49 |                 804.03 |                nan    |                     302.57 |                     817.55 |

And I've added a speedup table:

|                                       |   ('json', '2.0.9') |   ('nujson', '1.35.2') |   ('orjson', '3.6.8') |   ('simplejson', '3.17.6') |   ('ujson', '5.2.1.dev14') |
|:--------------------------------------|--------------------:|-----------------------:|----------------------:|---------------------------:|---------------------------:|
| ('Array of Dict[str, int]', 'dumps')  |                   1 |                    2.1 |                   7.2 |                       0.5  |                        1.8 |
| ('Array of Dict[str, int]', 'loads')  |                   1 |                    1.2 |                   1.4 |                       0.75 |                        1.3 |
| ('Array with True values', 'dumps')   |                   1 |                    1.4 |                   4.8 |                       0.93 |                        1.7 |
| ('Array with True values', 'loads')   |                   1 |                    1.1 |                   1.6 |                       0.93 |                        1.4 |
| ('Array with UTF-8 strings', 'dumps') |                   1 |                    1.6 |                   5   |                       0.96 |                        1.6 |
| ('Array with UTF-8 strings', 'loads') |                   1 |                    1.1 |                   0.7 |                       0.23 |                        1.1 |
| ('Array with doubles', 'dumps')       |                   1 |                    1.5 |                  17   |                       0.97 |                        5.2 |
| ('Array with doubles', 'loads')       |                   1 |                    3.7 |                   4   |                       0.92 |                        2.5 |
| ('Complex object', 'dumps')           |                   1 |                    1.4 |                 nan   |                       0.99 |                        1.4 |
| ('Complex object', 'loads')           |                   1 |                    1.4 |                 nan   |                       0.54 |                        1.5 |

To give a sense of how the scatter plot looks for one machine (note I ran the benchmarks under different conditions, which is something I do want to allow for)

I think it will be useful to overlay the scatterplot on top of the line plot. Unfortunately seaborn does not have an option for this. I also have to rework the error intervals around each line. I think seaborn assumes the rows are all pointwise measures, but in this case its useful to compress multiple stats into one row, so I need to tweak the plots a bit to reflect the actual std, and not the std of the means (which it currently does).

…to benchmark_stats_v2

for more information, see https://pre-commit.ci

Erotemic · 2022-05-29T16:44:47Z

Notes to self:

Might be good to combine or consolidate with this project: https://github.com/TkTech/json_benchmark
Should also include pysimdjson

for more information, see https://pre-commit.ci

bwoodsend · 2024-01-18T23:59:33Z

Is this ready for review then?

Erotemic · 2024-01-19T16:17:02Z

I've lost the thread on what I was doing here. I have a copy of result_analysis.py in a different project, noticed a bug, and then fixed it in this place as well. There were also uncommited changes that were in the last commit.

Looking at my original post, my goals for this are fairly ambitious, and I'm not sure I'll have the time to polish the full: run many times on different hardware and aggretate results feature set, but if I restrict focus to just producing a single set of benchmarks (i.e. suitable for graphs that you might show in a README), I might have some time to ensure that has a nice UX.

So, no. It's not ready for review. But if you're still interested in the feature, I can try to find time to pick this PR back up.

Erotemic added 3 commits May 9, 2022 13:54

working on benchmark framework with t-test analysis

2c47332

Working on benchmarks with details statistical analysis

da64282

Port datasets

d036df2

Erotemic added 4 commits May 26, 2022 10:09

log scale

daf8913

Reorganize as separate module

68c4a55

add support for loads and complex object bench

a89bc27

wip

4d0f705

Erotemic force-pushed the benchmark_stats_v2 branch from de98a60 to 4d0f705 Compare May 27, 2022 14:18

[pre-commit.ci] auto fixes from pre-commit.com hooks

3159c00

for more information, see https://pre-commit.ci

Erotemic and others added 2 commits May 28, 2022 21:56

tweaks

78cbf7e

[pre-commit.ci] auto fixes from pre-commit.com hooks

470f440

for more information, see https://pre-commit.ci

Erotemic and others added 5 commits May 28, 2022 22:12

wip

3b29b74

Merge branch 'benchmark_stats_v2' of github.com:Erotemic/ultrajson in…

a2fbbf1

…to benchmark_stats_v2

[pre-commit.ci] auto fixes from pre-commit.com hooks

283b5e5

for more information, see https://pre-commit.ci

use aggregate mean std to plot errors

03ae1b8

[pre-commit.ci] auto fixes from pre-commit.com hooks

eee2a5f

for more information, see https://pre-commit.ci

Erotemic and others added 2 commits May 29, 2022 13:26

Refactor core into measures and analysis submodules

bd592fd

[pre-commit.ci] auto fixes from pre-commit.com hooks

2b2aedb

for more information, see https://pre-commit.ci

Erotemic mentioned this pull request May 29, 2022

Expanding JSON benchmarks TkTech/json_benchmark#2

Closed

Erotemic and others added 6 commits May 29, 2022 18:56

Add simd libraries

b0bc25a

Fix cysimdjson

80d0960

[pre-commit.ci] auto fixes from pre-commit.com hooks

7dbb203

for more information, see https://pre-commit.ci

name fix

9358a54

wip

10ed58b

wip

9196d05

wip

2f3070d

This was referenced May 30, 2022

Generalize benchmarks #532

Closed

Fix handling of surrogates on encoding #530

Merged

Erotemic and others added 3 commits May 30, 2022 21:29

stats for fix-encode-surrogates

ac5b143

[pre-commit.ci] auto fixes from pre-commit.com hooks

fd951a3

for more information, see https://pre-commit.ci

Fixed bug

a580404

hugovk added the changelog: Changed For changes in existing functionality label Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark stats v2 #542

Benchmark stats v2 #542

Erotemic commented May 26, 2022 •

edited by bwoodsend

codecov-commenter commented May 26, 2022 •

edited

bwoodsend commented May 28, 2022

JustAnotherArchivist commented May 28, 2022

Erotemic commented May 29, 2022

Erotemic commented May 29, 2022

bwoodsend commented Jan 18, 2024

Erotemic commented Jan 19, 2024

Benchmark stats v2 #542

Are you sure you want to change the base?

Benchmark stats v2 #542

Conversation

Erotemic commented May 26, 2022 • edited by bwoodsend

codecov-commenter commented May 26, 2022 • edited

Codecov Report

bwoodsend commented May 28, 2022

JustAnotherArchivist commented May 28, 2022

Erotemic commented May 29, 2022

Erotemic commented May 29, 2022

bwoodsend commented Jan 18, 2024

Erotemic commented Jan 19, 2024

Erotemic commented May 26, 2022 •

edited by bwoodsend

codecov-commenter commented May 26, 2022 •

edited