Implement `MergeBy::fold` #920

kinto-b · 2024-04-18T04:29:57Z

Relates to #755

cargo bench --bench specializations merge_by/fold

merge_by/fold           time:   [741.63 ns 743.35 ns 745.12 ns]
                        change: [-75.583% -75.488% -75.382%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

codecov · 2024-04-18T04:44:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.42%. Comparing base (6814180) to head (ca57605).
Report is 55 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #920      +/-   ##
==========================================
+ Coverage   94.38%   94.42%   +0.03%     
==========================================
  Files          48       48              
  Lines        6665     6872     +207     
==========================================
+ Hits         6291     6489     +198     
- Misses        374      383       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Philippe-Cholet

-75% is a nice win!
I doubt my change below speed things even more but could you re-run the benchmark?

Side note: For Combinations::nth, I read part of criterion doc for ploting purposes (curiousity) but I also read that:

To save a baseline, use cargo bench -- --save-baseline <name>. To compare against an existing baseline, use cargo bench -- --baseline <name>.

I did not use this yet, but I guess we can:
Once feature branch created, cargo bench --bench specializations merge_by/fold -- --save-baseline mergeby
Then we can do this as many times as we want without overwriting the baseline: cargo bench --bench specializations merge_by/fold -- --baseline mergeby
Just a trick.

src/merge_join.rs

kinto-b · 2024-04-18T07:38:10Z

Oh saving the baseline is so useful, thanks for pointing that out! Compared to master with the change above it actually gets a little slower

$ cargo bench --bench specializations merge_by/fold -- --baseline merge_by

merge_by/fold           time:   [1.0928 µs 1.1066 µs 1.1247 µs]
                        change: [-65.005% -64.657% -64.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

I've noticed in the past that match can often be much faster than comparable expressions, so perhaps the compiler has a better time with the more verbose code.

Let me know if you want me to push the patch or leave as is.

Philippe-Cholet · 2024-04-18T10:39:03Z

I wondered, I checked out your branch, experimented, benchmarked but still wonder.

I eventually noticed that the benchmark I wrote only has zeros. Apart from being unrealistic which can be fine, I guess it repeatedly uses only one arm of the big match statement which is more problematic.
I changed vec![0; X]s to (0..X).collect_vec()s in the four related benchmarks to have more realistic use of the different code branches.
Please update this.

itertools/benches/specializations.rs

Lines 583 to 610 in f7d557d

    
           merge { 
        
               { 
        
                   let v1 = black_box(vec![0; 1024]); 
        
                   let v2 = black_box(vec![0; 768]); 
        
               } 
        
               v1.iter().merge(&v2) 
        
           } 
        
           merge_by { 
        
               { 
        
                   let v1 = black_box(vec![0; 1024]); 
        
                   let v2 = black_box(vec![0; 768]); 
        
               } 
        
               v1.iter().merge_by(&v2, PartialOrd::ge) 
        
           } 
        
           merge_join_by_ordering { 
        
               { 
        
                   let v1 = black_box(vec![0; 1024]); 
        
                   let v2 = black_box(vec![0; 768]); 
        
               } 
        
               v1.iter().merge_join_by(&v2, Ord::cmp) 
        
           } 
        
           merge_join_by_bool { 
        
               { 
        
                   let v1 = black_box(vec![0; 1024]); 
        
                   let v2 = black_box(vec![0; 768]); 
        
               } 
        
               v1.iter().merge_join_by(&v2, PartialOrd::ge) 
        
           }

cargo bench --bench specializations "^merge.*/fold" -- --baseline merge
# Yes, this iterator implementation is used by 4 itertools adaptors.

New baseline (with `(0..X).collect_vec()`s):
merge/fold                   [3.2654 µs 3.2811 µs 3.3028 µs]
merge_by/fold                [4.6043 µs 4.6183 µs 4.6351 µs]
merge_join_by_ordering/fold  [2.4700 µs 2.4762 µs 2.4842 µs]
merge_join_by_bool/fold      [3.3230 µs 3.3309 µs 3.3393 µs]

Your big match statement:
merge/fold                   [1.2520 µs1 1.2536 µs 1.2553 µs] [-61.495% -60.671% -59.636%]
merge_by/fold                [867.37 ns 868.89 ns 871.06 ns] [-81.308% -81.225% -81.142%]
merge_join_by_ordering/fold  [1.3648 µs 1.3709 µs 1.3816 µs] [-45.770% -45.250% -44.809%]
merge_join_by_bool/fold      [1.3348 µs 1.3394 µs 1.3443 µs] [-60.262% -60.076% -59.903%]

My shorter code:
merge/fold                   [1.6499 µs 1.6590 µs 1.6703 µs] [-49.632% -49.376% -49.111%]
merge_by/fold                [1.0078 µs 1.0100 µs 1.0125 µs] [-78.204% -78.098% -77.993%]
merge_join_by_ordering/fold  [1.5552 µs 1.5570 µs 1.5588 µs] [-38.101% -37.507% -37.015%]
merge_join_by_bool/fold      [1.3369 µs 1.3386 µs 1.3406 µs] [-60.141% -59.983% -59.840%]

The thing that bothers me is that our benchmark is still quite basic: adapted iterators are iterated slices, function f simple integer comparison. I did tend to think that avoid duplication of f, left.next, right.next calls might be beneficial in the general case, but maybe I'm just plain wrong though.

I would like a more experienced opinion here.
@phimuemue What do you think?

phimuemue

If we have this, couldn't this eliminate other specializations (e.g. count)?

The thing that bothers me is that our benchmark is still quite basic: adapted iterators are iterated slices, function f simple integer comparison. [...]

I think you're right. It's probably hard to get a really good estimate of a function's performance if you do not know the exact use.

But I am not sure how to reliably improve this. Should we construct benchmarks such that they generally execute many different code paths (possibly even so (almost) each iteration chooses another branch)? Have them operate on structs whose methods are inline(never) instead of on integers? Maybe look for usage in the wild, and boil down these real-world cases to their benchmarkable essence?

kinto-b · 2024-04-23T00:56:48Z

@phimuemue Yep, good call. The speed-up is pretty large (n.b. timings are using @Philippe-Cholet's modification). Happy to submit a separate PR removing these specialisations or add it to this one. Whatever y'all prefer

`.count()`

$ cargo bench --bench specializations "^merge.*/count"

merge/count             time:   [468.60 ns 474.97 ns 483.54 ns]
                        change: [-72.222% -71.218% -70.192%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_by/count          time:   [687.13 ns 691.60 ns 697.88 ns]
                        change: [-59.070% -58.742% -58.424%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_ordering/count
                        time:   [1.2102 µs 1.2263 µs 1.2487 µs]
                        change: [+6.6034% +8.7574% +11.219%] (p = 0.00 < 0.05)
                        Performance has regressed.

merge_join_by_bool/count
                        time:   [696.11 ns 710.89 ns 729.69 ns]
                        change: [-59.560% -58.013% -56.655%] (p = 0.00 < 0.05)
                        Performance has improved.

`.last()`

$ cargo bench --bench specializations "^merge.*/last"

merge/last              time:   [298.07 ns 304.83 ns 312.70 ns]
                        change: [-87.829% -87.624% -87.408%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_by/last           time:   [291.02 ns 292.75 ns 294.93 ns]
                        change: [-87.791% -87.594% -87.380%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_ordering/last
                        time:   [1.1109 µs 1.1148 µs 1.1199 µs]
                        change: [-30.084% -28.883% -27.828%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_bool/last time:   [573.23 ns 576.44 ns 580.60 ns]
                        change: [-70.422% -70.058% -69.739%] (p = 0.00 < 0.05)
                        Performance has improved.

Philippe-Cholet · 2024-04-23T08:08:02Z

Honestly, I'm not sure what's the point of getting the last or count of "merge two iterators": the count is basically iter1.count() + iter2.count() and last is either iter1.last() or iter2.last() based on how it's merged... except such thing change the consumption order which matters (only) for iterators with side-effects. (EDIT: No I'm dumb, if items are EitherOrBoth<L, R>, then the count can be less than count1+count2.)
It pains me but the some_iter.into_parts().1.count() part can be faster than with fold if the underlying iterators specialize count/last. But I'm tempted anyway to remove these two specializations with fold being specialized. It should be good enough.

I imagine this PR in 4 commits:

Update MergeBy-related benchmarks
I think (0..X).collect_vec()s are good enough because merge two of them interleave them and not just join them. (Then the benchmark baseline is here.)
Specialize MergeBy::fold
We need to decide so let's arbitrarly go with the verbose big match with 4 arms (one unreachable) and without wildcards. It can be changed later if there is evidence my version is better in more complicated situations. Problem with .or_else is that it's tied to (Option, Option, ..) returned by OrderingOrBool::merge.
Remove MergeBy::{count,last} or not?! @phimuemue (EDIT: yes remove)
Try to make "OrderingOrBool::merge returns (Option<Either<L, R>>, ..)" to discard the unreachable case.
If you struggle too much with this, I might do it in a next PR.

phimuemue · 2024-04-23T08:38:21Z

Sounds like a plan, @Philippe-Cholet.

Remove MergeBy::{count,last} or not?! @phimuemue

Let's remove it and comment that we could further specialize count for the respective variants.

kinto-b · 2024-04-23T09:15:28Z

Easy, I'll sort it out tomorrow. Thanks both :)

It is faster to rely on new `MergeBy::fold` implementation

Return `(Option<Either<L, R>>, MergeResult)` instead of `(Option<L>, Option<R>, MergeResult)`

kinto-b · 2024-04-24T00:23:36Z

Ok sorted. Simplifying OrderingBool::merge didn't impact fold performance

Philippe-Cholet

Nice work!
I'm still surprised that last and count are that faster as I expected their (old) specializations to be roughly as fast as "their default with fold specialized".

EDIT: Could not resist running benchmarks. Well, there is no doubt.

                             with last/count specialized     with only fold specialized
merge/count                  [1.7842 µs 1.7871 µs 1.7902 µs] [1.1981 µs 1.2033 µs 1.2103 µs] [-34.336% -32.623% -30.803%]
merge/last                   [2.1839 µs 2.1916 µs 2.2010 µs] [600.16 ns 602.19 ns 604.29 ns] [-73.024% -72.604% -72.156%]
merge_by/count               [1.7115 µs 1.7157 µs 1.7208 µs] [808.16 ns 812.90 ns 818.16 ns] [-53.581% -53.304% -53.010%]
merge_by/last                [1.5753 µs 1.5793 µs 1.5843 µs] [277.42 ns 278.06 ns 278.84 ns] [-82.877% -82.533% -82.194%]
merge_join_by_ordering/count [1.3395 µs 1.3439 µs 1.3485 µs] [1.0581 µs 1.0762 µs 1.0997 µs] [-21.419% -20.487% -19.520%]
merge_join_by_ordering/last  [1.6069 µs 1.6134 µs 1.6200 µs] [1.0075 µs 1.0103 µs 1.0135 µs] [-37.745% -37.080% -36.476%]
merge_join_by_bool/count     [1.7556 µs 1.7595 µs 1.7634 µs] [805.21 ns 807.31 ns 809.63 ns] [-54.132% -53.966% -53.797%]
merge_join_by_bool/last      [1.6153 µs 1.6207 µs 1.6266 µs] [543.29 ns 544.50 ns 545.80 ns] [-66.382% -66.234% -66.101%]

kinto-b marked this pull request as draft April 18, 2024 04:35

kinto-b force-pushed the feature/MergeBy-fold branch from ecb2cd2 to 0dfd502 Compare April 18, 2024 04:42

kinto-b marked this pull request as ready for review April 18, 2024 04:44

Philippe-Cholet requested changes Apr 18, 2024

View reviewed changes

src/merge_join.rs Outdated Show resolved Hide resolved

Philippe-Cholet added this to the next milestone Apr 18, 2024

Philippe-Cholet requested a review from phimuemue April 22, 2024 09:40

phimuemue reviewed Apr 22, 2024

View reviewed changes

Implement MergeBy::fold

e8b0daa

kinto-b force-pushed the feature/MergeBy-fold branch from 0dfd502 to e8b0daa Compare April 23, 2024 23:51

kinto-b added 3 commits April 24, 2024 09:59

Make merge specialization benchmark more realistic

1955923

Remove MergeBy::last and MergeBy::count

752f021

It is faster to rely on new `MergeBy::fold` implementation

Simplify OrderingOrBool::merge

ca57605

Return `(Option<Either<L, R>>, MergeResult)` instead of `(Option<L>, Option<R>, MergeResult)`

Philippe-Cholet approved these changes Apr 24, 2024

View reviewed changes

Philippe-Cholet added this pull request to the merge queue Apr 24, 2024

Merged via the queue into rust-itertools:master with commit 2f53aea Apr 24, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `MergeBy::fold` #920

Implement `MergeBy::fold` #920

kinto-b commented Apr 18, 2024

codecov bot commented Apr 18, 2024 •

edited

Philippe-Cholet left a comment

kinto-b commented Apr 18, 2024

Philippe-Cholet commented Apr 18, 2024

phimuemue left a comment

kinto-b commented Apr 23, 2024 •

edited

Philippe-Cholet commented Apr 23, 2024 •

edited

phimuemue commented Apr 23, 2024

kinto-b commented Apr 23, 2024

kinto-b commented Apr 24, 2024

Philippe-Cholet left a comment •

edited

Implement MergeBy::fold #920

Implement MergeBy::fold #920

Conversation

kinto-b commented Apr 18, 2024

codecov bot commented Apr 18, 2024 • edited

Codecov Report

Philippe-Cholet left a comment

Choose a reason for hiding this comment

kinto-b commented Apr 18, 2024

Philippe-Cholet commented Apr 18, 2024

phimuemue left a comment

Choose a reason for hiding this comment

kinto-b commented Apr 23, 2024 • edited

.count()

.last()

Philippe-Cholet commented Apr 23, 2024 • edited

phimuemue commented Apr 23, 2024

kinto-b commented Apr 23, 2024

kinto-b commented Apr 24, 2024

Philippe-Cholet left a comment • edited

Choose a reason for hiding this comment

Implement `MergeBy::fold` #920

Implement `MergeBy::fold` #920

codecov bot commented Apr 18, 2024 •

edited

kinto-b commented Apr 23, 2024 •

edited

`.count()`

`.last()`

Philippe-Cholet commented Apr 23, 2024 •

edited

Philippe-Cholet left a comment •

edited