Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MergeBy::fold #920

Merged
merged 4 commits into from Apr 24, 2024

Conversation

kinto-b
Copy link
Contributor

@kinto-b kinto-b commented Apr 18, 2024

Relates to #755

cargo bench --bench specializations merge_by/fold

merge_by/fold           time:   [741.63 ns 743.35 ns 745.12 ns]
                        change: [-75.583% -75.488% -75.382%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

@kinto-b kinto-b marked this pull request as draft April 18, 2024 04:35
@kinto-b kinto-b marked this pull request as ready for review April 18, 2024 04:44
Copy link

codecov bot commented Apr 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.42%. Comparing base (6814180) to head (ca57605).
Report is 55 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #920      +/-   ##
==========================================
+ Coverage   94.38%   94.42%   +0.03%     
==========================================
  Files          48       48              
  Lines        6665     6872     +207     
==========================================
+ Hits         6291     6489     +198     
- Misses        374      383       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-75% is a nice win!
I doubt my change below speed things even more but could you re-run the benchmark?


Side note: For Combinations::nth, I read part of criterion doc for ploting purposes (curiousity) but I also read that:

  • To save a baseline, use cargo bench -- --save-baseline <name>. To compare against an existing baseline, use cargo bench -- --baseline <name>.

I did not use this yet, but I guess we can:
Once feature branch created, cargo bench --bench specializations merge_by/fold -- --save-baseline mergeby
Then we can do this as many times as we want without overwriting the baseline: cargo bench --bench specializations merge_by/fold -- --baseline mergeby
Just a trick.

src/merge_join.rs Outdated Show resolved Hide resolved
@Philippe-Cholet Philippe-Cholet added this to the next milestone Apr 18, 2024
@kinto-b
Copy link
Contributor Author

kinto-b commented Apr 18, 2024

Oh saving the baseline is so useful, thanks for pointing that out! Compared to master with the change above it actually gets a little slower

$ cargo bench --bench specializations merge_by/fold -- --baseline merge_by

merge_by/fold           time:   [1.0928 µs 1.1066 µs 1.1247 µs]
                        change: [-65.005% -64.657% -64.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

I've noticed in the past that match can often be much faster than comparable expressions, so perhaps the compiler has a better time with the more verbose code.

Let me know if you want me to push the patch or leave as is.

@Philippe-Cholet
Copy link
Member

I wondered, I checked out your branch, experimented, benchmarked but still wonder.


I eventually noticed that the benchmark I wrote only has zeros. Apart from being unrealistic which can be fine, I guess it repeatedly uses only one arm of the big match statement which is more problematic.
I changed vec![0; X]s to (0..X).collect_vec()s in the four related benchmarks to have more realistic use of the different code branches.
Please update this.

merge {
{
let v1 = black_box(vec![0; 1024]);
let v2 = black_box(vec![0; 768]);
}
v1.iter().merge(&v2)
}
merge_by {
{
let v1 = black_box(vec![0; 1024]);
let v2 = black_box(vec![0; 768]);
}
v1.iter().merge_by(&v2, PartialOrd::ge)
}
merge_join_by_ordering {
{
let v1 = black_box(vec![0; 1024]);
let v2 = black_box(vec![0; 768]);
}
v1.iter().merge_join_by(&v2, Ord::cmp)
}
merge_join_by_bool {
{
let v1 = black_box(vec![0; 1024]);
let v2 = black_box(vec![0; 768]);
}
v1.iter().merge_join_by(&v2, PartialOrd::ge)
}


cargo bench --bench specializations "^merge.*/fold" -- --baseline merge
# Yes, this iterator implementation is used by 4 itertools adaptors.

New baseline (with `(0..X).collect_vec()`s):
merge/fold                   [3.2654 µs 3.2811 µs 3.3028 µs]
merge_by/fold                [4.6043 µs 4.6183 µs 4.6351 µs]
merge_join_by_ordering/fold  [2.4700 µs 2.4762 µs 2.4842 µs]
merge_join_by_bool/fold      [3.3230 µs 3.3309 µs 3.3393 µs]

Your big match statement:
merge/fold                   [1.2520 µs1 1.2536 µs 1.2553 µs] [-61.495% -60.671% -59.636%]
merge_by/fold                [867.37 ns 868.89 ns 871.06 ns] [-81.308% -81.225% -81.142%]
merge_join_by_ordering/fold  [1.3648 µs 1.3709 µs 1.3816 µs] [-45.770% -45.250% -44.809%]
merge_join_by_bool/fold      [1.3348 µs 1.3394 µs 1.3443 µs] [-60.262% -60.076% -59.903%]

My shorter code:
merge/fold                   [1.6499 µs 1.6590 µs 1.6703 µs] [-49.632% -49.376% -49.111%]
merge_by/fold                [1.0078 µs 1.0100 µs 1.0125 µs] [-78.204% -78.098% -77.993%]
merge_join_by_ordering/fold  [1.5552 µs 1.5570 µs 1.5588 µs] [-38.101% -37.507% -37.015%]
merge_join_by_bool/fold      [1.3369 µs 1.3386 µs 1.3406 µs] [-60.141% -59.983% -59.840%]

The thing that bothers me is that our benchmark is still quite basic: adapted iterators are iterated slices, function f simple integer comparison. I did tend to think that avoid duplication of f, left.next, right.next calls might be beneficial in the general case, but maybe I'm just plain wrong though.

I would like a more experienced opinion here.
@phimuemue What do you think?

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have this, couldn't this eliminate other specializations (e.g. count)?

The thing that bothers me is that our benchmark is still quite basic: adapted iterators are iterated slices, function f simple integer comparison. [...]

I think you're right. It's probably hard to get a really good estimate of a function's performance if you do not know the exact use.

But I am not sure how to reliably improve this. Should we construct benchmarks such that they generally execute many different code paths (possibly even so (almost) each iteration chooses another branch)? Have them operate on structs whose methods are inline(never) instead of on integers? Maybe look for usage in the wild, and boil down these real-world cases to their benchmarkable essence?

@kinto-b
Copy link
Contributor Author

kinto-b commented Apr 23, 2024

@phimuemue Yep, good call. The speed-up is pretty large (n.b. timings are using @Philippe-Cholet's modification). Happy to submit a separate PR removing these specialisations or add it to this one. Whatever y'all prefer

.count()

$ cargo bench --bench specializations "^merge.*/count"

merge/count             time:   [468.60 ns 474.97 ns 483.54 ns]
                        change: [-72.222% -71.218% -70.192%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_by/count          time:   [687.13 ns 691.60 ns 697.88 ns]
                        change: [-59.070% -58.742% -58.424%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_ordering/count
                        time:   [1.2102 µs 1.2263 µs 1.2487 µs]
                        change: [+6.6034% +8.7574% +11.219%] (p = 0.00 < 0.05)
                        Performance has regressed.

merge_join_by_bool/count
                        time:   [696.11 ns 710.89 ns 729.69 ns]
                        change: [-59.560% -58.013% -56.655%] (p = 0.00 < 0.05)
                        Performance has improved.

.last()

$ cargo bench --bench specializations "^merge.*/last"

merge/last              time:   [298.07 ns 304.83 ns 312.70 ns]
                        change: [-87.829% -87.624% -87.408%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_by/last           time:   [291.02 ns 292.75 ns 294.93 ns]
                        change: [-87.791% -87.594% -87.380%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_ordering/last
                        time:   [1.1109 µs 1.1148 µs 1.1199 µs]
                        change: [-30.084% -28.883% -27.828%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_bool/last time:   [573.23 ns 576.44 ns 580.60 ns]
                        change: [-70.422% -70.058% -69.739%] (p = 0.00 < 0.05)
                        Performance has improved.

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Apr 23, 2024

Honestly, I'm not sure what's the point of getting the last or count of "merge two iterators": the count is basically iter1.count() + iter2.count() and last is either iter1.last() or iter2.last() based on how it's merged... except such thing change the consumption order which matters (only) for iterators with side-effects. (EDIT: No I'm dumb, if items are EitherOrBoth<L, R>, then the count can be less than count1+count2.)
It pains me but the some_iter.into_parts().1.count() part can be faster than with fold if the underlying iterators specialize count/last. But I'm tempted anyway to remove these two specializations with fold being specialized. It should be good enough.

I imagine this PR in 4 commits:

  1. Update MergeBy-related benchmarks
    I think (0..X).collect_vec()s are good enough because merge two of them interleave them and not just join them. (Then the benchmark baseline is here.)
  2. Specialize MergeBy::fold
    We need to decide so let's arbitrarly go with the verbose big match with 4 arms (one unreachable) and without wildcards. It can be changed later if there is evidence my version is better in more complicated situations. Problem with .or_else is that it's tied to (Option, Option, ..) returned by OrderingOrBool::merge.
  3. Remove MergeBy::{count,last} or not?! @phimuemue (EDIT: yes remove)
  4. Try to make "OrderingOrBool::merge returns (Option<Either<L, R>>, ..)" to discard the unreachable case.
    If you struggle too much with this, I might do it in a next PR.

@phimuemue
Copy link
Member

Sounds like a plan, @Philippe-Cholet.

Remove MergeBy::{count,last} or not?! @phimuemue

Let's remove it and comment that we could further specialize count for the respective variants.

@kinto-b
Copy link
Contributor Author

kinto-b commented Apr 23, 2024

Easy, I'll sort it out tomorrow. Thanks both :)

It is faster to rely on new `MergeBy::fold` implementation
Return `(Option<Either<L, R>>, MergeResult)` instead of `(Option<L>, Option<R>, MergeResult)`
@kinto-b
Copy link
Contributor Author

kinto-b commented Apr 24, 2024

Ok sorted. Simplifying OrderingBool::merge didn't impact fold performance

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
I'm still surprised that last and count are that faster as I expected their (old) specializations to be roughly as fast as "their default with fold specialized".

EDIT: Could not resist running benchmarks. Well, there is no doubt.

                             with last/count specialized     with only fold specialized
merge/count                  [1.7842 µs 1.7871 µs 1.7902 µs] [1.1981 µs 1.2033 µs 1.2103 µs] [-34.336% -32.623% -30.803%]
merge/last                   [2.1839 µs 2.1916 µs 2.2010 µs] [600.16 ns 602.19 ns 604.29 ns] [-73.024% -72.604% -72.156%]
merge_by/count               [1.7115 µs 1.7157 µs 1.7208 µs] [808.16 ns 812.90 ns 818.16 ns] [-53.581% -53.304% -53.010%]
merge_by/last                [1.5753 µs 1.5793 µs 1.5843 µs] [277.42 ns 278.06 ns 278.84 ns] [-82.877% -82.533% -82.194%]
merge_join_by_ordering/count [1.3395 µs 1.3439 µs 1.3485 µs] [1.0581 µs 1.0762 µs 1.0997 µs] [-21.419% -20.487% -19.520%]
merge_join_by_ordering/last  [1.6069 µs 1.6134 µs 1.6200 µs] [1.0075 µs 1.0103 µs 1.0135 µs] [-37.745% -37.080% -36.476%]
merge_join_by_bool/count     [1.7556 µs 1.7595 µs 1.7634 µs] [805.21 ns 807.31 ns 809.63 ns] [-54.132% -53.966% -53.797%]
merge_join_by_bool/last      [1.6153 µs 1.6207 µs 1.6266 µs] [543.29 ns 544.50 ns 545.80 ns] [-66.382% -66.234% -66.101%]

@Philippe-Cholet Philippe-Cholet added this pull request to the merge queue Apr 24, 2024
Merged via the queue into rust-itertools:master with commit 2f53aea Apr 24, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants