Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Create Scan parallel iterator #1036

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

arieluy
Copy link

@arieluy arieluy commented Mar 16, 2023

Hi, here's a draft of a parallel scan function based on discussion in #329 and #393.

General Approach

  1. Use a consumer to split the iterator into pieces, and perform sequential scan on each, accumulating into a Vec<T>. The reduce step combines them into a LinkedList<Vec<T>>.
  2. Sequentially, compute the "offset" we need to add to each section, by looking at the last element of each Vec (and doing another scan). Also, collect the linked list into a Vec<Vec<T>> so we can operate on it in parallel again.
  3. Create an unindexed producer which adds the offsets to each Vec in parallel.

In order to maximize the parallelism, we don't want to split too much, since that creates more sequential work. I've been using .with_min_len() in order to prevent that.

Happy to take feedback on better ways to do any of these steps.

Benchmarks

I also tried writing some benchmarks. I ran these on my M2 Mac with 12 threads.

  • For a regular prefix sum or product on ints, the parallel overhead is too much to see any improvement with the parallel version.
  • Only sufficiently slow operations will benefit, but they can see a fairly large improvement. I tried testing this by writing an addition function that waits a certain amount of time before returning. With a 100000 size input:
    • With a delay time of 100ns, the parallel speedup is 2.99
    • With a delay time of 10000ns, the parallel speedup is 5.27
  • A slightly more realistic example is matrix multiplication. Here's the execution time on various numbers of threads, compared to the sequential baseline.

matrix scan

Potential Improvements

  • Write a specialized version for prefix sums/products
  • Output an indexed producer instead of unindexed. This is definitely harder, but I have a few ideas for how it might be doable. Is this important?
  • Testing on higher core machines
  • General cleanup + docs

Write a parallel scan function that consumes a parallel iterator,
then produces a new one. Add some benchmarking and testing.
src/iter/mod.rs Outdated
@@ -1384,6 +1387,14 @@ pub trait ParallelIterator: Sized + Send {
sum::sum(self)
}

fn scan<F>(self, scan_op: F, id: Self::Item) -> Scan<Self::Item, F>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start here, documenting at the API level, especially for a user who is looking for an equivalent to Iterator::scan. How would you describe what this does, and importantly how is it different than the sequential version? The signature of F is quite different, looking more like some flavor of reduce.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I wasn't exactly sure what the signature should be, and it could change. But it's similar to the way reduce compares to fold. This should have an identical result to sequential scan when the operation doesn't have internal state, and is associative. If not, it won't make sense to run it in parallel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cuviper, could you take another look at this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel you addressed my first comment? The signature is part of it, but please make an attempt at adding documentation describing what this does. And frankly, I don't think many people even use Iterator::scan, so it's not sufficient to just reference that. Suppose this existed independently - describe what it does, and what points are important for the user to think about.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood you last time. I've added documentation.

I was hoping to get some feedback on the signature and general approach. These are the differences from the other ParallelIterator functions, and my rationale for them:

  1. The scan operation currently has type Fn(&Item, &Item) -> Item, since when we call it iteratively, we need to keep each intermediate result and not consume it.
  2. Since scan_op takes in references, we don't need multiple copies of identity. So, I have identity as type Item, rather than Fn() -> Item, since it's simpler. The other functions use Fn() -> Item, though, and it would be easy to change.

Do you have any preferences on those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants