Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the iterator adaptive splitting strategy #857

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented May 13, 2021

Before, when an iterator job was stolen, we would reset the split count
all the way back to current_num_threads to adaptively split jobs more
aggressively when threads seem to need more work. This ends up splitting
a lot farther than a lot of people expect, especially in the tail end of
a computation when threads are fighting over what's left. Excess
splitting can also be harmful for things like fold or map_with that
want to share state as much as possible.

We can get a much lazier "adaptive" effect by just not updating the
split count when we split a stolen job, effectively giving it only one
extra boost of splitting.

@cuviper
Copy link
Member Author

cuviper commented May 13, 2021

Here are my benchmark results with a 5% threshold. It's rather mixed at extremes, so I'm not sure how to evaluate this...

 name                                                                split-num-threads ns/iter  split-skip ns/iter     diff ns/iter   diff %  speedup
 factorial::factorial_par_iter                                       785,857                    744,186                     -41,671   -5.30%   x 1.06
 factorial::factorial_recursion                                      1,312,877                  1,393,096                    80,219    6.11%   x 0.94
 fibonacci::fibonacci_split_iterative                                79,043                     66,834                      -12,209  -15.45%   x 1.18
 fibonacci::fibonacci_split_recursive                                597,931                    536,831                     -61,100  -10.22%   x 1.11
 join_microbench::increment_all                                      62,094                     51,761                      -10,333  -16.64%   x 1.20
 join_microbench::increment_all_max                                  73,677                     181,079                     107,402  145.77%   x 0.41
 life::bench::par_iter_generations                                   11,345,270                 9,244,708                -2,100,562  -18.51%   x 1.23
 map_collect::i_mod_10_to_i::with_collect                            4,321,277                  3,916,068                  -405,209   -9.38%   x 1.10
 map_collect::i_mod_10_to_i::with_fold                               683,736                    506,218                    -177,518  -25.96%   x 1.35
 map_collect::i_mod_10_to_i::with_fold_vec                           819,674                    557,608                    -262,066  -31.97%   x 1.47
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec            4,286,532                  3,917,595                  -368,937   -8.61%   x 1.09
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized      4,354,821                  3,908,441                  -446,380  -10.25%   x 1.11
 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized   4,404,012                  3,931,468                  -472,544  -10.73%   x 1.12
 map_collect::i_mod_10_to_i::with_mutex_vec                          11,913,779                 9,188,517                -2,725,262  -22.87%   x 1.30
 map_collect::i_mod_10_to_i::with_vec_vec_sized                      4,601,016                  4,026,035                  -574,981  -12.50%   x 1.14
 map_collect::i_to_i::with_collect                                   6,957,727                  6,354,939                  -602,788   -8.66%   x 1.09
 map_collect::i_to_i::with_fold_vec                                  38,067,101                 35,491,125               -2,575,976   -6.77%   x 1.07
 map_collect::i_to_i::with_linked_list_collect_vec_sized             6,840,195                  6,383,791                  -456,404   -6.67%   x 1.07
 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized          6,901,355                  6,313,163                  -588,192   -8.52%   x 1.09
 map_collect::i_to_i::with_mutex_vec                                 36,533,047                 33,202,140               -3,330,907   -9.12%   x 1.10
 map_collect::i_to_i::with_vec_vec_sized                             7,247,123                  6,418,205                  -828,918  -11.44%   x 1.13
 nbody::bench::nbody_parreduce                                       9,854,831                  11,379,041                1,524,210   15.47%   x 0.87
 pythagoras::euclid_parallel_one                                     2,877,928                  3,144,155                   266,227    9.25%   x 0.92
 pythagoras::euclid_parallel_weightless                              2,878,436                  3,131,320                   252,884    8.79%   x 0.92
 quicksort::bench::quick_sort_splitter                               5,588,636                  13,225,091                7,636,455  136.64%   x 0.42
 sort::demo_merge_sort_ascending                                     102,956 (3885 MB/s)        114,411 (3496 MB/s)          11,455   11.13%   x 0.90
 sort::demo_merge_sort_big                                           6,817,448 (938 MB/s)       6,112,125 (1047 MB/s)      -705,323  -10.35%   x 1.12
 sort::demo_merge_sort_descending                                    110,011 (3636 MB/s)        201,164 (1988 MB/s)          91,153   82.86%   x 0.55
 sort::demo_merge_sort_mostly_ascending                              258,798 (1545 MB/s)        462,874 (864 MB/s)          204,076   78.86%   x 0.56
 sort::demo_merge_sort_mostly_descending                             255,883 (1563 MB/s)        481,346 (831 MB/s)          225,463   88.11%   x 0.53
 sort::demo_quick_sort_big                                           3,858,490 (1658 MB/s)      2,992,156 (2138 MB/s)      -866,334  -22.45%   x 1.29
 sort::demo_quick_sort_strings                                       3,346,042 (239 MB/s)       3,122,390 (256 MB/s)       -223,652   -6.68%   x 1.07
 sort::par_sort_unstable_big                                         1,830,812 (3495 MB/s)      2,530,672 (2528 MB/s)       699,860   38.23%   x 0.72
 sort::par_sort_unstable_mostly_descending                           184,598 (2166 MB/s)        171,825 (2327 MB/s)         -12,773   -6.92%   x 1.07
 str_split::parallel_space_char                                      264,467                    213,826                     -50,641  -19.15%   x 1.24
 str_split::parallel_space_fn                                        215,124                    182,302                     -32,822  -15.26%   x 1.18
 vec_collect::vec_i::with_collect_into_vec_reused                    419,200                    381,671                     -37,529   -8.95%   x 1.10
 vec_collect::vec_i::with_fold                                       8,026,621                  7,100,183                  -926,438  -11.54%   x 1.13
 vec_collect::vec_i::with_linked_list_collect_vec                    4,623,693                  3,617,714                -1,005,979  -21.76%   x 1.28
 vec_collect::vec_i::with_linked_list_collect_vec_sized              4,765,766                  3,618,636                -1,147,130  -24.07%   x 1.32
 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized           4,659,306                  3,644,773                -1,014,533  -21.77%   x 1.28
 vec_collect::vec_i::with_vec_vec_sized                              4,454,536                  3,543,003                  -911,533  -20.46%   x 1.26
 vec_collect::vec_i_filtered::with_collect                           5,133,122                  4,143,791                  -989,331  -19.27%   x 1.24
 vec_collect::vec_i_filtered::with_fold                              10,158,530                 9,013,442                -1,145,088  -11.27%   x 1.13
 vec_collect::vec_i_filtered::with_linked_list_collect_vec           6,698,931                  5,980,331                  -718,600  -10.73%   x 1.12
 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized     6,810,821                  6,010,995                  -799,826  -11.74%   x 1.13
 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized  5,073,883                  4,195,550                  -878,333  -17.31%   x 1.21
 vec_collect::vec_i_filtered::with_vec_vec_sized                     4,915,209                  4,116,080                  -799,129  -16.26%   x 1.19

@nikomatsakis
Copy link
Member

cc @wagnerf42

@wagnerf42
Copy link
Contributor

hi, i'd like to take some time looking at the benches.

on top of my head here are raw comments:

  • it is impossible to always win, (i have one example where the current algorithm is not splitting enough already). the idea for me would be to have good defaults and to allow users to tune the policy when they really need it. it is actually possible to move the scheduling policy out of the bridge and into an adaptor so it would be ok to screw a few examples.
  • i don't like micro benches so much because they need to be on very small inputs. they bring something but a 20% speed increase give an impression of being 20% faster while if you double the size you are maybe only 10% faster. i'd rather count the number of tasks and draw some logs to figure out what is going on.
  • i'd like to target specifically examples which might suffer from this change (they need some un-even load distribution)
  • that being said, the current policy is cutting way too much for high number of threads.

if you can give me a bit more time (1 week ?) i will take a closer look.

@cuviper
Copy link
Member Author

cuviper commented May 17, 2021

it is impossible to always win, [...] the idea for me would be to have good defaults and to allow users to tune the policy when they really need it.

Yes, I agree that needs will vary, but I hope this change makes a better default.

i don't like micro benches so much because they need to be on very small inputs.

Yeah, it's hard. Even with the larger benchmarks (cargo run in rayon-demo rather than cargo bench), I'm not really confident that we're representing realistic workloads, or enough diversity.

if you can give me a bit more time (1 week ?) i will take a closer look.

I'm not in a rush, so I'm happy to let you dig further -- thanks!

@wagnerf42
Copy link
Contributor

hi, so here is one example i had in mind:

            let r = (0..*size)
                .into_par_iter()
                .map(|e| (0..e).map(|x| x % 2).sum::<u64>())
                .sum::<u64>();

so, why that ?

if everything is nicely balanced then i don't see how the proposed modification could be bad.
this example makes sense for me because this is what people would do for parallel combinations.
(all 2 out of n) and it is not balanced.

in this example when the parallel range gets divided in two, the left part contains 1/4 of total work and the right part 3/4. for performances the scheduler will need to divide regularly throughout the execution. (log(n) times).
note that for two threads it is not handled nicely by the current rayon and there is a high randomness in the execution times.

i wonder of the following: if you need to keep generating tasks on one branch during the execution then i think at one point it should be possible that only one single stealable task remains with the new mechanism. what does it mean ? well it is still enough to distribute work to all threads but now all stealers must try around p times unsuccesfully before achieving a successful steal.
is that a big deal ? well maybe/maybe not because the number of steals is related to the depth which is logarithmic here.

i did a run of this code on a 32 cores machine using 64 threads and the new code was slower (40%) there (sizes 100k to 300k).

i'd still need to take a look inside the run to see what is really going on.

i'll also try some more benches next week.

@cuviper
Copy link
Member Author

cuviper commented May 21, 2021

Thank you for your investigation! The slowdown is surprising and disappointing, but I hope you'll be able to get a complete picture of why that is. We should also encode that knowledge in source comments at least, because this is all currently very opaque.

@adamreichold
Copy link
Collaborator

adamreichold commented Oct 5, 2021

To add another data point to the discussion, I have an example at GeoStat-Framework/GSTools-Core#6 where this makes the difference between a slow down (serial versus parallel) and a speed-up because without it around 500 large temporary arrays are needed for a fold and using it, this drops down to 150 (with theoretical maximum of 1000). This is especially important in that case as I cannot use with_min_len due to ndarray's Zip not providing an IndexedParallelIterator implementation.

Before, when an iterator job was stolen, we would reset the split count
all the way back to `current_num_threads` to adaptively split jobs more
aggressively when threads seem to need more work. This ends up splitting
a lot farther than a lot of people expect, especially in the tail end of
a computation when threads are fighting over what's left. Excess
splitting can also be harmful for things like `fold` or `map_with` that
want to share state as much as possible.

We can get a much lazier "adaptive" effect by just not updating the
split count when we split a stolen job, effectively giving it only _one_
extra boost of splitting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants