Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rayon slower than single threaded performance when mixing iter and par_iter #1136

Open
bergkvist opened this issue Feb 26, 2024 · 9 comments

Comments

@bergkvist
Copy link

Consider the following "mixed iterator" example:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            (0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} ms", t0.elapsed().as_millis());
    println!("result: {result}");
}

Timings:

$ cargo run --release
elapsed: 4173 ms
result: 2053539102720

# Notice how this is significantly faster than the multithreaded example above:
$ RAYON_NUM_THREADS=1 cargo run --release
elapsed: 1275 ms
result: 2053539102720

Using perf on Linux:

$ cargo build --release
$ sudo perf record -g ./target/release/rayon-perf-example
$ sudo perf report

+   43.80%    16.74%  rayon-perf-exam  rayon-perf-example    [.] crossbeam_epoch::default::with_handle
+   38.15%     1.91%  rayon-perf-exam  [kernel.vmlinux]      [k] entry_SYSCALL_64
+   34.90%     0.94%  rayon-perf-exam  [kernel.vmlinux]      [k] do_syscall_64
+   34.52%     0.96%  rayon-perf-exam  libc.so.6             [.] __sched_yield
+   29.09%    19.91%  rayon-perf-exam  rayon-perf-example    [.] crossbeam_deque::deque::Stealer<T>::steal
+   27.71%     2.37%  rayon-perf-exam  [kernel.vmlinux]      [k] __schedule
+   27.12%     0.57%  rayon-perf-exam  [kernel.vmlinux]      [k] __ia32_sys_sched_yield
+   26.31%     0.40%  rayon-perf-exam  [kernel.vmlinux]      [k] schedule
+   18.91%     4.69%  rayon-perf-exam  [kernel.vmlinux]      [k] pick_next_task_fair
+   11.99%     4.58%  rayon-perf-exam  rayon-perf-example    [.] _ZN15crossbeam_epoch8internal6Global11try_advance17h69e4b2d1793d9e17E.llvm.1688863770226311152
+   10.11%     4.04%  rayon-perf-exam  [kernel.vmlinux]      [k] update_curr
+    7.90%     5.06%  rayon-perf-exam  rayon-perf-example    [.] <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
+    6.96%     0.83%  rayon-perf-exam  [kernel.vmlinux]      [k] update_rq_clock
+    6.68%     2.61%  rayon-perf-exam  [kernel.vmlinux]      [k] psi_account_irqtime
...

into_par_iter on all 3 nesting levels (400x faster)

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).into_par_iter().map(|x| {
        (0..512).into_par_iter().map(|y| {
            (0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} ms", t0.elapsed().as_millis());
    println!("result: {result}");
}

Timings:

$ cargo run --release
elapsed: 8 ms
result: 2053539102720

$ RAYON_NUM_THREADS=1 cargo run --release
elapsed: 12 ms
result: 2053539102720

Using only serial iterators

This turns out to actually be the fastest in this case, so using rayon at all was not advantageous for this particular problem.

use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            (0..16).map(|z| black_box(x*y*z)).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} ms", t0.elapsed().as_millis());
    println!("result: {result}");
}

Timings:

$ cargo run --release
elapsed: 2 ms
result: 2053539102720
@adamreichold
Copy link
Collaborator

Generally speaking, parallelization does have a non-negligible overhead and hence whether it actually results in speed-ups depends on the amount of work done per item. In nested loops, it is there often the case that not parallelizing all levels is optimal. From past experience, I would recommend starting with parallelizing the outermost loop and then moving inwards so start with the largest work items and repeated reduce their size but increase their count as long as that pays off.

In your case, I see a speed-up from parallelizing the two outer levels, but it is minimal compared to just the outermost level which I would probably stick to here:

> hyperfine ./serial ./parallel-outermost ./parallel-two-levels 
Benchmark 1: ./serial
  Time (mean ± σ):       2.6 ms ±   0.2 ms    [User: 1.7 ms, System: 1.3 ms]
  Range (min … max):     2.3 ms …   3.6 ms    773 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./parallel-outermost
  Time (mean ± σ):       1.4 ms ±   0.1 ms    [User: 3.1 ms, System: 4.6 ms]
  Range (min … max):     1.2 ms …   2.1 ms    919 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 3: ./parallel-two-levels
  Time (mean ± σ):       1.3 ms ±   0.1 ms    [User: 3.2 ms, System: 4.3 ms]
  Range (min … max):     1.1 ms …   1.9 ms    1093 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  ./parallel-two-levels ran
    1.01 ± 0.10 times faster than ./parallel-outermost
    1.91 ± 0.19 times faster than ./serial

i.e.

use std::hint::black_box;

use rayon::prelude::*;

fn main() {
    let result = (0..512).into_par_iter().map(|x| {
        (0..512).map(|y| {
            (0..16).map(|z| black_box(x*y*z)).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();

    black_box(result);
}

Finally, maybe you could expand on what you would consider a helpful change here? Did you find the documentation lacking somewhere?

P.S.: As an side, as Hyperfine warns, it would probably be nicer to time multiple iterations of the above as the runtime of a single one is below 5ms.

@bergkvist
Copy link
Author

bergkvist commented Feb 26, 2024

My point is that it is surprising that iter(iter(par_iter)) has 400x more overhead than par_iter(par_iter(par_iter)) in the example, and that setting RAYON_NUM_THREADS=1 is not enough to make it behave similarly to the sequential version (being 100x slower still).

Imagine that the inner scope is within another function used a lot of places in a code-base, initially sequential, but later made parallel.

As you slowly start introducing rayon into the code-base, you might see big performance improvements in some places, while unexpectedly also getting 100x slowdowns in other parts of your code.

@cuviper
Copy link
Member

cuviper commented Feb 26, 2024

The problem is that the work you're starting in parallel is tiny, not that there are iterators on the outside.

(0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()

Even when forced to a single thread, it's still sending that to the thread pool and waiting for the result, and I'm not surprised at all if that thread synchronization is more costly than the actual work.

@bergkvist
Copy link
Author

Yes, I intentionally crafted the example to have very little compute to make the overhead as visible as possible. Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

@adamreichold
Copy link
Collaborator

adamreichold commented Feb 26, 2024

Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version.

In your example of some utility function encapsulating some parallel processing, you would probably need to avoid excessive splitting using something with_min_len if you know that your work items might be too small to benefit from parallelization on their own.

@cuviper
Copy link
Member

cuviper commented Feb 26, 2024

Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version.

Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly.

@bergkvist
Copy link
Author

bergkvist commented Feb 26, 2024

Oh, I didn't know about with_min_length - this seems like a great way to build parallelized functions with less surprising performance behaviour when used in various contexts:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            my_sum_function(x, y, 16)
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} us", t0.elapsed().as_micros());
    println!("result: {result}");
}

fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
    (0..n).into_par_iter().with_min_len(512).map(|z| black_box(x*y*z)).sum::<usize>()
}

Timings (notice I'm using microseconds not milliseconds here):

$ cargo run --release
elapsed: 6562 us
result: 2053539102720

That said, seems like the following still has less overhead:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            my_sum_function(x, y, 16)
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} us", t0.elapsed().as_micros());
    println!("result: {result}");
}

fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
    if n < 512 {
        (0..n).map(|z| black_box(x*y*z)).sum::<usize>()
    } else {
        (0..n).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
    }
}

Timings:

$ cargo run --releas
elapsed: 1890 us
result: 2053539102720

@bergkvist
Copy link
Author

Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly.
So if I understand correctly

This seems to be right, when observing the following code (where we simply wrap the entire iter(iter(par_iter)) in a rayon::join):

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let (result, _) = rayon::join(|| {
        (0..512).map(|x| {
            (0..512).map(|y| {
                (0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
            }).sum::<usize>()
        }).sum::<usize>()
    }, || {

    });
    
    println!("elapsed: {} ms", t0.elapsed().as_millis());
    println!("result: {result}");
}

And we can see that here RAYON_NUM_THREADS=1 is 170x faster than the parallel one, presumably because the tasks don't need to be sent around and are moved into the single-threaded thread pool all at once by the rayon::join:

$ cargo run --release
elapsed: 1764 ms
result: 2053539102720

$ RAYON_NUM_THREADS=1 cargo run --release
elapsed: 13 ms
result: 2053539102720

I'm guessing the reason the multi-threaded example is still somewhat slow is because the work is moved around in the thread pool a lot.

@ar37-rs
Copy link

ar37-rs commented Mar 6, 2024

Probably there's a corelation between spawning a bulldozer and a scoop. of course, scoop is much faster for tiny tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants