Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark for JUnitTestCaseSorter::uniqueByTestFile #1177

Merged
merged 38 commits into from Mar 24, 2020

Conversation

sanmai
Copy link
Member

@sanmai sanmai commented Mar 17, 2020

This PR:

  • Adds a bechmark for JUnitTestCaseSorter, based on live data
  • Adds optional bucket sort
  • Covered by tests
  • Is isset() required for PHP 7.4?
  • Standard benchmark profile

Looking just at the Big O notation for the bucket sort, it seems to be more effective in general, but for smaller number of items and/or for larger number of buckets Quicksort is more effective, again, looking at the notation only.

Here's a Big O graph for 23 buckets:

graph for 23 buckets

Blue is for quicksort. Orange is for bucket sort.

For 25 buckets we're aiming the sort becomes theoretically more effective after 15 elements. Considering that we don't do the second stage of intra-bucket sorting, this effectiveness boundary might be closer for us, yet still far.

If we consider the distribution of overly tested mutations, we can see that our bucket sort will be applied in roughly 40% of cases in case of Psalm.

Bucket selection algorithm

Since effectiveness of the algorithm is in direct relationship with the number of buckets, and since our data is effectively skewed towards a specific mode at around 1/3 of a second, it does not help if we use a fixed number of buckets.

Test time distribution gistagram

Instead we gradually lower precision of our timings starting with 8th of a second, later lowering it to 4 seconds. Also we pre-sort bucket array for the first second worth of buckets. By using calculated buckets we try to avoid looping around bucket numbers.

Example of a bucket distribution
Bucket Test times
0 0.01, 0.08, 0.06 ... everything under 1/8 of a second goes here
1 (nothing fell here, skipping these hereafter)
2 0.26, 0.31, 0.28, 0.36, 0.35
4 0.56, 0.53
5 0.75
7 0.99, 0.95, 0.94
8 1.06
10 1.25
11 1.45, 1.44
14 1.8, 1.86
16 2.11, 2.05
18 2.28
19 2.45
23 2.93
24 3.03
25 3.17, 3.21
28 3.62, 3.53
30 3.75
32 7.81, 5.93, 4.42, 4.37, 4.22, 6.09, 4.27, 4.4, 6.75, 7.11, 4.93, 6.79, 6.1, 4.95, 6.66, 7.88, 7.69, 5.24, 5.8, 5.58, 5.99
64 8.57, 9.34, 8.92, 8.97, 8.22, 11.77, 8.61, 9.98, 9.13
96 15.25, 12.75, 12.48, 12.9, 13.7, 15.06, 14.85, 14.22, 15.08, 12.22
128 19.57, 17.9, 17.94, 18.89
160 20.73, 23.94, 22.72, 20.55, 20.31
192 26.77
224 29.58
256 32.26, 32.39
352 45.48, 46.91

Further optimization begs the following questions:

  • Should we care about tests longer than one seconds? E.g. if not, we can leave only several buckets under one second, and leave everything else above one second unsorted.
  • Aren't we overfitting our optimization procedure to Psalm only? Might be better to verify our approach against a greater dataset.
  • Is this really the bottleneck we should care about? (No doubt this was a fun exercise.)

Copy link
Member

@theofidry theofidry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks looks good to me otherwise

@theofidry
Copy link
Member

I would also love us to add a benchmark to profile the mutations -> mutant process, but I don't think it should hold this PR either

@sanmai
Copy link
Member Author

sanmai commented Mar 20, 2020

I'd love to test it with the new benchmarks, but this isn't particularly easy because we test only so much mutations, not the whole range. I'll see to it anyway a bit later.

@sanmai sanmai marked this pull request as ready for review March 20, 2020 03:05
@maks-rafalko maks-rafalko added this to the 0.16.0 milestone Mar 20, 2020
Copy link
Member

@maks-rafalko maks-rafalko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR. I can't say I understood every single line here (which makes me think it will probably be hard to support this code for average contributor without digging into the theory), but it seems like a very smart improvemnt.

  1. Does Psalm have many mutations with >15 tests that cover the mutated line?
  2. Is this diagram for Psalm?
  3. What is the performance win in % for Psalm?

@sanmai
Copy link
Member Author

sanmai commented Mar 23, 2020

  1. 40% of Psalm's mutations have more than 15 tests. This is the graph.
  2. This graph is for Big O for quicksort and bucket sort.
  3. I cannot make a straight on benchmark on Psalm because Blackfire requires some impossible amounts of RAM I don't have. Our standard benchmarks show no particular improvement, but neither a regression. Might be because of insufficient depth.

Third point is the major point for me. Even if local benchmarks show improvement, global improvement might be below the noise threshold. I'd like to see them, and to show them, but it appears to be hard for one reason or another.

sanmai and others added 3 commits March 23, 2020 17:47
@sanmai sanmai changed the base branch from master to 0.16 March 23, 2020 09:13
@sanmai sanmai merged commit c538af7 into infection:0.16 Mar 24, 2020
@sanmai sanmai deleted the pr/2020-03/test-timings branch March 24, 2020 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants