Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance with rayon ParallelIterator #170

Open
pkolaczk opened this issue Apr 21, 2020 · 17 comments
Open

Poor performance with rayon ParallelIterator #170

pkolaczk opened this issue Apr 21, 2020 · 17 comments

Comments

@pkolaczk
Copy link

pkolaczk commented Apr 21, 2020

When processing several hundreds of thousands of items per second with a ParallelIterator, ProgressBar becomes a bottleneck because it grabs a global shared lock whenever the state needs to be updated. Instead, it would be better to store internal progress for each rayon thread separately and sum these counters on read.

Having said that as a Rust beginner, I'm not sure how hard such implementation could be in Rust and whether Rust has a similar structure like Java LongAdder (https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/LongAdder.html).

RelaxedCounter from https://docs.rs/atomic-counter/1.0.1/atomic_counter/ maybe?

@mibac138
Copy link
Contributor

By global shared lock do you mean the stderr lock? If writing to the console is your bottleneck, you can use ProgressBar::set_draw_delta to limit the number of updates. The bigger the number you provide the less frequent the updates are going to be and thus the less locking will occur.

@pkolaczk
Copy link
Author

pkolaczk commented Apr 21, 2020

No, stderr is unlikely a problem. I already tried reducing the rendering refresh rate to one per second and it didn't help at all.

The problem is the fact that the whole ProgressBar is wrapped in an RwLock :

See this in rayon_support:

    pub struct ParProgressBarIter<T> {
        it: T,
        progress: Arc<Mutex<ProgressBar>>,
    }

And also updating the internal state requires locking:

    fn update_and_draw<F: FnOnce(&mut ProgressState)>(&self, f: F) {
        let mut draw = false;
        {
            let mut state = self.state.write().unwrap();

@pkolaczk
Copy link
Author

pkolaczk commented Apr 21, 2020

As a quick check I tried using RelaxedCounter and it seems to work much better on my iterator that delivers about ~1 million items / second.

Here a fragment of my original code:

    let file_scan_pb = ProgressBar::new_spinner();
    file_scan_pb.set_draw_target(ProgressDrawTarget::stdout_with_hz(1));
    file_scan_pb.set_style(ProgressStyle::default_spinner().template("Scanning files: {pos}"));

    let files = ... // obtain a parallel iterator
    files
        .progress_with(file_scan_pb)
        .foreach(|item| ...);  

And after changing to the counter:

    let counter = RelaxedCounter::new(0);
    let files = ... // obtain a parallel iterator   
    files
        .inspect(|_| { counter.inc(); () })
        .foreach(|item| ...);  
    println!("{}", counter.get());  // to avoid smart compiler optimizing our counter out

This one is about 3x faster, just as fast as without the inspect() line at all and now the bottleneck is my iterator.

@mibac138
Copy link
Contributor

Oh, I see the problem now. Could you try and use an experimental branch to see if it helps?
To do that you will need to modify your Cargo.toml. Change indicatif = "0.14.0" to indicatif = { git="https://github.com/mibac138/indicatif", branch="rayon-rate-limit" }. Then try it out like so:

let pb = ...
let iter = pb.wrap_par_iter(files);
iter.set_update_delta(1000);
iter.foreach(|item| ...);

@pkolaczk
Copy link
Author

error[E0599]: no method named `wrap_par_iter` found for struct `indicatif::progress::ProgressBar` in the current scope
  --> src/main.rs:62:36
   |
62 |     let size_groups = file_scan_pb.wrap_par_iter(files)
   |                                    ^^^^^^^^^^^^^ help: there is a method with a similar name: `wrap_iter`

@mibac138
Copy link
Contributor

Oops, please make sure you also enable with_rayon feature like so indicatif = { git="https://github.com/mibac138/indicatif", branch="rayon-rate-limit, features = ["with_rayon"] }

@pkolaczk
Copy link
Author

Does not compile :(

   Compiling indicatif v0.14.0 (https://github.com/mibac138/indicatif?branch=rayon-rate-limit#555b0eb6)
error[E0425]: cannot find value `update_delta` in this scope
  --> /home/pkolaczk/.cargo/git/checkouts/indicatif-4be6e86cdcc437b6/555b0eb/src/par_iter.rs:70:36
   |
70 |         self.data.set_update_delta(update_delta);
   |                                    ^^^^^^^^^^^^ not found in this scope

error[E0609]: no field `progress` on type `&par_iter::ProgressConsumer<C>`
   --> /home/pkolaczk/.cargo/git/checkouts/indicatif-4be6e86cdcc437b6/555b0eb/src/par_iter.rs:119:64
    |
119 |         ProgressConsumer::new(self.base.split_off_left(), self.progress.clone())
    |                                                                ^^^^^^^^ unknown field
    |
    = note: available fields are: `base`, `data`

error[E0615]: attempted to take value of method `progress` on type `par_iter::ParProgressBarIter<T>`
   --> /home/pkolaczk/.cargo/git/checkouts/indicatif-4be6e86cdcc437b6/555b0eb/src/par_iter.rs:156:62
    |
156 |         let consumer1 = ProgressConsumer::new(consumer, self.progress);
    |                                                              ^^^^^^^^ help: use parentheses to call the method: `progress()`

error: aborting due to 3 previous errors

@mibac138
Copy link
Contributor

It looks like the errors are in indicatif, however I have already fixed them. Running cargo update should make it work.

@pkolaczk
Copy link
Author

pkolaczk commented Apr 21, 2020

Looks like it works, however I had to disable the refresh rate, because otherwise it didn't write anything at all.

update_delta = 1:

real    0m9,160s
user    0m11,640s
sys     0m12,154s

update_delta = 10:

real    0m3,252s
user    0m7,111s
sys     0m8,080s

update_delta=100:

real    0m2,612s
user    0m6,521s
sys     0m8,098s

No progressbar at all:

real    0m2,535s
user    0m6,023s
sys     0m7,870s

I'm not entirely happy with this workaround, though, because the "speed" at which my iterator is able to deliver items is heavily dependent on many circumstances, and hardcoding a magic number doesn't feel right to me. I'd rather progress bar recorded all updates fast like in the example with the RelaxedCounter.

Anyway, it is still better than nothing!
Thank you! :)

BTW: Is it possible to do a pull-based progress bar? Like - I could provide a RelaxedCounter and PB would only fetch its value and display repeatedly?

@mibac138
Copy link
Contributor

BTW: Is it possible to do a pull-based progress bar? Like - I could provide a RelaxedCounter and PB would only fetch its value and display repeatedly?

You could write your own enable_steady_tick implementation and it should work. Something like this:

let counter = Arc::new(counter); // use this in inspect(counter.inc())

let counter2 = Arc::clone(&counter);
let pb2 = pb.clone();
thread::spawn(move || {
    if Arc::strong_count(&counter2) == 1 {
        // intensive counting is over, stop updating in a separate thread
        break;
    }
    thread::sleep(interval);
    pb2.set_position(counter2.get());
});

@pkolaczk
Copy link
Author

pkolaczk commented Apr 21, 2020

Thank you!

This is nice:

if Arc::strong_count(&counter2) == 1 {
        // intensive counting is over, stop updating in a separate thread
        break;
    }

I believe this hides the progress bar once the counter is dropped? Is it right?

@mibac138
Copy link
Contributor

mibac138 commented Apr 21, 2020

I believe this hides the progress bar once the counter is dropped? Is it right?

TLDR: Yes if all instances of the same progress bar are dropped (pb.clone will refer to the same instance)

When a ProgressBar is dropped it is hidden unless you have called finish or finish_with_message. The problem is that for the progress bar to be dropped all Arcs holding it must be dropped too. This check makes sure that the Arc owned by the spawned thread is dropped so that the ProgressBar can be dropped too. It could be made simpler but it requires #161 which isn't yet merged.

@thomwiggers
Copy link

thomwiggers commented Jan 22, 2021

Using @mibac138 's fork gives me a lot of

thread '<unnamed>' panicked at 'attempt to divide by zero', /home/thom/.cargo/git/checkouts/indicatif-4be6e86cdcc437b6/c716edb/src/par_iter.rs:47:26

I'm iterating over quite large values.

:(

@sstadick
Copy link

sstadick commented Feb 9, 2021

Is this a use case that indicatif is intending to support as a first class option? I also ran into this issue (indicatif being the bottleneck in a hot par_iter loop and am looking for fixes.

I'm also willing to making a PR to fix/enable this with something along the enable_steady_tick route if you are open to that

@Ekleog
Copy link

Ekleog commented Mar 25, 2021

FWIW, this problem is also very visible outside of rayon: I'm parsing gzipped log files that have a few million lines, and depending on the option I pick I get vastly different speeds:

  • If I never inc the progress bar, it takes 34 seconds
  • If I .wrap_read() the file (before giving it to libflate::gzip::Decoder because otherwise it's going way over 100%), it takes about 20 minutes (never actually tried waiting that long, but it's the indicatif-provided estimate)

And when parsing another, non-gzipped log file:

  • If I never inc the bar, it completes in 2 seconds
  • If I inc the bar by blocks of file size / 2048, it completes in 2 seconds
  • If I .wrap_read() the file, it completes in 4 seconds (note that ideally I'd be incing the bar only when actually having handled the line, not merely when having read it, so this is fast-ish [just twice as slow as alternatives] but imprecise, though it's probably not that big a deal)
  • If I inc the bar on each line read, it completes in 1 min 5 sec

Overall, I'd think that the best option might be to “just” make ProgressBar's internal counter an AtomicU64, and then instead of precomputing all the metadata on each inc() (which naturally requires a mutex), to re-compute it on each tick. This being said, I'm far from familiar with the internals of indicatif, so… this may be impossible for some reason I don't see? But if it is possible, I think it'd solve all the issues at the same time, while also providing the best user interface.

WDYT?

@untitaker
Copy link

untitaker commented Feb 4, 2022

If you're reading this thread because the progressbar is slowing down your application and don't yet know exactly why, try using set_draw_rate() and check if it still does. In my usage of indicatif I found that the application is primarily bottlenecked on terminal output, not any of what OP is talking about, even when processing 10k items per second via rayon. See also #363

That is not to say that OP's problem isn't with the locks, it depends a lot on what kind of terminal you use AFAICT, but it just really wasn't that for me.

@djc
Copy link
Collaborator

djc commented Feb 22, 2022

If someone can retest this against current main and still sees bad performance, I'd like to hear about it. The internal locking is still pretty much the same, but a bunch of other improvements should have improved performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants