[WIP] Create Scan parallel iterator #1036

arieluy · 2023-03-16T04:41:08Z

Hi, here's a draft of a parallel scan function based on discussion in #329 and #393.

General Approach

Use a consumer to split the iterator into pieces, and perform sequential scan on each, accumulating into a Vec<T>. The reduce step combines them into a LinkedList<Vec<T>>.
Sequentially, compute the "offset" we need to add to each section, by looking at the last element of each Vec (and doing another scan). Also, collect the linked list into a Vec<Vec<T>> so we can operate on it in parallel again.
Create an unindexed producer which adds the offsets to each Vec in parallel.

In order to maximize the parallelism, we don't want to split too much, since that creates more sequential work. I've been using .with_min_len() in order to prevent that.

Happy to take feedback on better ways to do any of these steps.

Benchmarks

I also tried writing some benchmarks. I ran these on my M2 Mac with 12 threads.

For a regular prefix sum or product on ints, the parallel overhead is too much to see any improvement with the parallel version.
Only sufficiently slow operations will benefit, but they can see a fairly large improvement. I tried testing this by writing an addition function that waits a certain amount of time before returning. With a 100000 size input:
- With a delay time of 100ns, the parallel speedup is 2.99
- With a delay time of 10000ns, the parallel speedup is 5.27
A slightly more realistic example is matrix multiplication. Here's the execution time on various numbers of threads, compared to the sequential baseline.

Potential Improvements

Write a specialized version for prefix sums/products
Output an indexed producer instead of unindexed. This is definitely harder, but I have a few ideas for how it might be doable. Is this important?
Testing on higher core machines
General cleanup + docs

Write a parallel scan function that consumes a parallel iterator, then produces a new one. Add some benchmarking and testing.

cuviper · 2023-03-24T20:40:22Z

src/iter/mod.rs

@@ -1384,6 +1387,14 @@ pub trait ParallelIterator: Sized + Send {
        sum::sum(self)
    }

+    fn scan<F>(self, scan_op: F, id: Self::Item) -> Scan<Self::Item, F>


Let's start here, documenting at the API level, especially for a user who is looking for an equivalent to Iterator::scan. How would you describe what this does, and importantly how is it different than the sequential version? The signature of F is quite different, looking more like some flavor of reduce.

Good question, I wasn't exactly sure what the signature should be, and it could change. But it's similar to the way reduce compares to fold. This should have an identical result to sequential scan when the operation doesn't have internal state, and is associative. If not, it won't make sense to run it in parallel.

Hi @cuviper, could you take another look at this?

I don't feel you addressed my first comment? The signature is part of it, but please make an attempt at adding documentation describing what this does. And frankly, I don't think many people even use Iterator::scan, so it's not sufficient to just reference that. Suppose this existed independently - describe what it does, and what points are important for the user to think about.

Sorry, I misunderstood you last time. I've added documentation.

I was hoping to get some feedback on the signature and general approach. These are the differences from the other ParallelIterator functions, and my rationale for them:

The scan operation currently has type Fn(&Item, &Item) -> Item, since when we call it iteratively, we need to keep each intermediate result and not consume it.

Since scan_op takes in references, we don't need multiple copies of identity. So, I have identity as type Item, rather than Fn() -> Item, since it's simpler. The other functions use Fn() -> Item, though, and it would be easy to change.

Do you have any preferences on those?

Create Scan unindexed parallel iterator

ff947e3

Write a parallel scan function that consumes a parallel iterator, then produces a new one. Add some benchmarking and testing.

cuviper reviewed Mar 24, 2023

View reviewed changes

Add docs for scan

ad1b436

mratsim mentioned this pull request Feb 2, 2024

Implement parallel prefix sum / parallel scan privacy-scaling-explorations/halo2#262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Create Scan parallel iterator #1036

[WIP] Create Scan parallel iterator #1036

arieluy commented Mar 16, 2023

cuviper Mar 24, 2023

arieluy Mar 29, 2023

arieluy Apr 11, 2023

cuviper Apr 11, 2023

arieluy Apr 18, 2023

[WIP] Create Scan parallel iterator #1036

Are you sure you want to change the base?

[WIP] Create Scan parallel iterator #1036

Conversation

arieluy commented Mar 16, 2023

General Approach

Benchmarks

Potential Improvements

cuviper Mar 24, 2023

Choose a reason for hiding this comment

arieluy Mar 29, 2023

Choose a reason for hiding this comment

arieluy Apr 11, 2023

Choose a reason for hiding this comment

cuviper Apr 11, 2023

Choose a reason for hiding this comment

arieluy Apr 18, 2023

Choose a reason for hiding this comment