Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add par_sort_by_cached_key #932

Merged
merged 3 commits into from May 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 24 additions & 0 deletions rayon-demo/src/sort.rs
Expand Up @@ -79,6 +79,17 @@ macro_rules! sort {
};
}

macro_rules! sort_keys {
($f:ident, $name:ident, $gen:expr, $len:expr) => {
#[bench]
fn $name(b: &mut Bencher) {
let v = $gen($len);
b.iter(|| v.clone().$f(ToString::to_string));
b.bytes = $len * mem::size_of_val(&$gen(1)[0]) as u64;
}
};
}

macro_rules! sort_strings {
($f:ident, $name:ident, $gen:expr, $len:expr) => {
#[bench]
Expand Down Expand Up @@ -136,6 +147,13 @@ sort!(par_sort, par_sort_random, gen_random, 50_000);
sort!(par_sort, par_sort_big, gen_big_random, 50_000);
sort_strings!(par_sort, par_sort_strings, gen_strings, 50_000);
sort_expensive!(par_sort_by, par_sort_expensive, gen_random, 50_000);
sort_keys!(par_sort_by_key, par_sort_by_key, gen_random, 50_000);
sort_keys!(
par_sort_by_cached_key,
par_sort_by_cached_key,
gen_random,
50_000
);

sort!(
par_sort_unstable,
Expand Down Expand Up @@ -185,6 +203,12 @@ sort_expensive!(
gen_random,
50_000
);
sort_keys!(
par_sort_unstable_by_key,
par_sort_unstable_by_key,
gen_random,
50_000
);

trait MergeSort {
fn demo_merge_sort(&mut self);
Expand Down
206 changes: 172 additions & 34 deletions src/slice/mod.rs
Expand Up @@ -20,6 +20,7 @@ use crate::split_producer::*;
use std::cmp;
use std::cmp::Ordering;
use std::fmt::{self, Debug};
use std::mem;

pub use self::chunks::{Chunks, ChunksExact, ChunksExactMut, ChunksMut};
pub use self::rchunks::{RChunks, RChunksExact, RChunksExactMut, RChunksMut};
Expand Down Expand Up @@ -268,7 +269,7 @@ pub trait ParallelSliceMut<T: Send> {

/// Sorts the slice in parallel.
///
/// This sort is stable (i.e. does not reorder equal elements) and `O(n log n)` worst-case.
/// This sort is stable (i.e., does not reorder equal elements) and *O*(*n* \* log(*n*)) worst-case.
///
/// When applicable, unstable sorting is preferred because it is generally faster than stable
/// sorting and it doesn't allocate auxiliary memory.
Expand Down Expand Up @@ -308,7 +309,25 @@ pub trait ParallelSliceMut<T: Send> {

/// Sorts the slice in parallel with a comparator function.
///
/// This sort is stable (i.e. does not reorder equal elements) and `O(n log n)` worst-case.
/// This sort is stable (i.e., does not reorder equal elements) and *O*(*n* \* log(*n*)) worst-case.
///
/// The comparator function must define a total ordering for the elements in the slice. If
/// the ordering is not total, the order of the elements is unspecified. An order is a
/// total order if it is (for all `a`, `b` and `c`):
///
/// * total and antisymmetric: exactly one of `a < b`, `a == b` or `a > b` is true, and
/// * transitive, `a < b` and `b < c` implies `a < c`. The same must hold for both `==` and `>`.
///
/// For example, while [`f64`] doesn't implement [`Ord`] because `NaN != NaN`, we can use
/// `partial_cmp` as our sort function when we know the slice doesn't contain a `NaN`.
///
/// ```
/// use rayon::prelude::*;
///
/// let mut floats = [5f64, 4.0, 1.0, 3.0, 2.0];
/// floats.par_sort_by(|a, b| a.partial_cmp(b).unwrap());
/// assert_eq!(floats, [1.0, 2.0, 3.0, 4.0, 5.0]);
/// ```
///
/// When applicable, unstable sorting is preferred because it is generally faster than stable
/// sorting and it doesn't allocate auxiliary memory.
Expand Down Expand Up @@ -353,7 +372,12 @@ pub trait ParallelSliceMut<T: Send> {

/// Sorts the slice in parallel with a key extraction function.
///
/// This sort is stable (i.e. does not reorder equal elements) and `O(n log n)` worst-case.
/// This sort is stable (i.e., does not reorder equal elements) and *O*(*m* \* *n* \* log(*n*))
/// worst-case, where the key function is *O*(*m*).
///
/// For expensive key functions (e.g. functions that are not simple property accesses or
/// basic operations), [`par_sort_by_cached_key`](#method.par_sort_by_cached_key) is likely to
/// be significantly faster, as it does not recompute element keys.
///
/// When applicable, unstable sorting is preferred because it is generally faster than stable
/// sorting and it doesn't allocate auxiliary memory.
Expand Down Expand Up @@ -384,27 +408,119 @@ pub trait ParallelSliceMut<T: Send> {
/// v.par_sort_by_key(|k| k.abs());
/// assert_eq!(v, [1, 2, -3, 4, -5]);
/// ```
fn par_sort_by_key<B, F>(&mut self, f: F)
fn par_sort_by_key<K, F>(&mut self, f: F)
where
B: Ord,
F: Fn(&T) -> B + Sync,
K: Ord,
F: Fn(&T) -> K + Sync,
{
par_mergesort(self.as_parallel_slice_mut(), |a, b| f(a).lt(&f(b)));
}

/// Sorts the slice in parallel, but may not preserve the order of equal elements.
/// Sorts the slice in parallel with a key extraction function.
///
/// This sort is unstable (i.e. may reorder equal elements), in-place (i.e. does not allocate),
/// and `O(n log n)` worst-case.
/// During sorting, the key function is called at most once per element, by using
/// temporary storage to remember the results of key evaluation.
/// The key function is called in parallel, so the order of calls is completely unspecified.
///
/// This sort is stable (i.e., does not reorder equal elements) and *O*(*m* \* *n* + *n* \* log(*n*))
/// worst-case, where the key function is *O*(*m*).
///
/// For simple key functions (e.g., functions that are property accesses or
/// basic operations), [`par_sort_by_key`](#method.par_sort_by_key) is likely to be
/// faster.
///
/// # Current implementation
///
/// The current algorithm is based on Orson Peters' [pattern-defeating quicksort][pdqsort],
/// which is a quicksort variant designed to be very fast on certain kinds of patterns,
/// sometimes achieving linear time. It is randomized but deterministic, and falls back to
/// heapsort on degenerate inputs.
/// The current algorithm is based on [pattern-defeating quicksort][pdqsort] by Orson Peters,
/// which combines the fast average case of randomized quicksort with the fast worst case of
/// heapsort, while achieving linear time on slices with certain patterns. It uses some
/// randomization to avoid degenerate cases, but with a fixed seed to always provide
/// deterministic behavior.
///
/// In the worst case, the algorithm allocates temporary storage in a `Vec<(K, usize)>` the
/// length of the slice.
///
/// It is generally faster than stable sorting, except in a few special cases, e.g. when the
/// All quicksorts work in two stages: partitioning into two halves followed by recursive
/// calls. The partitioning phase is sequential, but the two recursive calls are performed in
/// parallel. Finally, after sorting the cached keys, the item positions are updated sequentially.
///
/// [pdqsort]: https://github.com/orlp/pdqsort
///
/// # Examples
///
/// ```
/// use rayon::prelude::*;
///
/// let mut v = [-5i32, 4, 32, -3, 2];
///
/// v.par_sort_by_cached_key(|k| k.to_string());
/// assert!(v == [-3, -5, 2, 32, 4]);
/// ```
fn par_sort_by_cached_key<K, F>(&mut self, f: F)
where
F: Fn(&T) -> K + Sync,
K: Ord + Send,
{
let slice = self.as_parallel_slice_mut();
let len = slice.len();
if len < 2 {
return;
}

// Helper macro for indexing our vector by the smallest possible type, to reduce allocation.
macro_rules! sort_by_key {
($t:ty) => {{
let mut indices: Vec<_> = slice
.par_iter_mut()
.enumerate()
.map(|(i, x)| (f(&*x), i as $t))
.collect();
// The elements of `indices` are unique, as they are indexed, so any sort will be
// stable with respect to the original slice. We use `sort_unstable` here because
// it requires less memory allocation.
indices.par_sort_unstable();
for i in 0..len {
let mut index = indices[i].1;
while (index as usize) < i {
index = indices[index as usize].1;
}
indices[i].1 = index;
slice.swap(i, index as usize);
}
}};
}

let sz_u8 = mem::size_of::<(K, u8)>();
let sz_u16 = mem::size_of::<(K, u16)>();
let sz_u32 = mem::size_of::<(K, u32)>();
let sz_usize = mem::size_of::<(K, usize)>();

if sz_u8 < sz_u16 && len <= (std::u8::MAX as usize) {
return sort_by_key!(u8);
}
if sz_u16 < sz_u32 && len <= (std::u16::MAX as usize) {
return sort_by_key!(u16);
}
if sz_u32 < sz_usize && len <= (std::u32::MAX as usize) {
return sort_by_key!(u32);
}
sort_by_key!(usize)
}

/// Sorts the slice in parallel, but might not preserve the order of equal elements.
///
/// This sort is unstable (i.e., may reorder equal elements), in-place
/// (i.e., does not allocate), and *O*(*n* \* log(*n*)) worst-case.
///
/// # Current implementation
///
/// The current algorithm is based on [pattern-defeating quicksort][pdqsort] by Orson Peters,
/// which combines the fast average case of randomized quicksort with the fast worst case of
/// heapsort, while achieving linear time on slices with certain patterns. It uses some
/// randomization to avoid degenerate cases, but with a fixed seed to always provide
/// deterministic behavior.
///
/// It is typically faster than stable sorting, except in a few special cases, e.g., when the
/// slice consists of several concatenated sorted sequences.
///
/// All quicksorts work in two stages: partitioning into two halves followed by recursive
Expand All @@ -430,20 +546,39 @@ pub trait ParallelSliceMut<T: Send> {
par_quicksort(self.as_parallel_slice_mut(), T::lt);
}

/// Sorts the slice in parallel with a comparator function, but may not preserve the order of
/// Sorts the slice in parallel with a comparator function, but might not preserve the order of
/// equal elements.
///
/// This sort is unstable (i.e. may reorder equal elements), in-place (i.e. does not allocate),
/// and `O(n log n)` worst-case.
/// This sort is unstable (i.e., may reorder equal elements), in-place
/// (i.e., does not allocate), and *O*(*n* \* log(*n*)) worst-case.
///
/// The comparator function must define a total ordering for the elements in the slice. If
/// the ordering is not total, the order of the elements is unspecified. An order is a
/// total order if it is (for all `a`, `b` and `c`):
///
/// * total and antisymmetric: exactly one of `a < b`, `a == b` or `a > b` is true, and
/// * transitive, `a < b` and `b < c` implies `a < c`. The same must hold for both `==` and `>`.
///
/// For example, while [`f64`] doesn't implement [`Ord`] because `NaN != NaN`, we can use
/// `partial_cmp` as our sort function when we know the slice doesn't contain a `NaN`.
///
/// ```
/// use rayon::prelude::*;
///
/// let mut floats = [5f64, 4.0, 1.0, 3.0, 2.0];
/// floats.par_sort_unstable_by(|a, b| a.partial_cmp(b).unwrap());
/// assert_eq!(floats, [1.0, 2.0, 3.0, 4.0, 5.0]);
/// ```
///
/// # Current implementation
///
/// The current algorithm is based on Orson Peters' [pattern-defeating quicksort][pdqsort],
/// which is a quicksort variant designed to be very fast on certain kinds of patterns,
/// sometimes achieving linear time. It is randomized but deterministic, and falls back to
/// heapsort on degenerate inputs.
/// The current algorithm is based on [pattern-defeating quicksort][pdqsort] by Orson Peters,
/// which combines the fast average case of randomized quicksort with the fast worst case of
/// heapsort, while achieving linear time on slices with certain patterns. It uses some
/// randomization to avoid degenerate cases, but with a fixed seed to always provide
/// deterministic behavior.
///
/// It is generally faster than stable sorting, except in a few special cases, e.g. when the
/// It is typically faster than stable sorting, except in a few special cases, e.g., when the
/// slice consists of several concatenated sorted sequences.
///
/// All quicksorts work in two stages: partitioning into two halves followed by recursive
Expand Down Expand Up @@ -474,21 +609,24 @@ pub trait ParallelSliceMut<T: Send> {
});
}

/// Sorts the slice in parallel with a key extraction function, but may not preserve the order
/// Sorts the slice in parallel with a key extraction function, but might not preserve the order
/// of equal elements.
///
/// This sort is unstable (i.e. may reorder equal elements), in-place (i.e. does not allocate),
/// and `O(n log n)` worst-case.
/// This sort is unstable (i.e., may reorder equal elements), in-place
/// (i.e., does not allocate), and *O*(m \* *n* \* log(*n*)) worst-case,
/// where the key function is *O*(*m*).
///
/// # Current implementation
///
/// The current algorithm is based on Orson Peters' [pattern-defeating quicksort][pdqsort],
/// which is a quicksort variant designed to be very fast on certain kinds of patterns,
/// sometimes achieving linear time. It is randomized but deterministic, and falls back to
/// heapsort on degenerate inputs.
/// The current algorithm is based on [pattern-defeating quicksort][pdqsort] by Orson Peters,
/// which combines the fast average case of randomized quicksort with the fast worst case of
/// heapsort, while achieving linear time on slices with certain patterns. It uses some
/// randomization to avoid degenerate cases, but with a fixed seed to always provide
/// deterministic behavior.
///
/// It is generally faster than stable sorting, except in a few special cases, e.g. when the
/// slice consists of several concatenated sorted sequences.
/// Due to its key calling strategy, `par_sort_unstable_by_key` is likely to be slower than
/// [`par_sort_by_cached_key`](#method.par_sort_by_cached_key) in cases where the key function
/// is expensive.
///
/// All quicksorts work in two stages: partitioning into two halves followed by recursive
/// calls. The partitioning phase is sequential, but the two recursive calls are performed in
Expand All @@ -506,10 +644,10 @@ pub trait ParallelSliceMut<T: Send> {
/// v.par_sort_unstable_by_key(|k| k.abs());
/// assert_eq!(v, [1, 2, -3, 4, -5]);
/// ```
fn par_sort_unstable_by_key<B, F>(&mut self, f: F)
fn par_sort_unstable_by_key<K, F>(&mut self, f: F)
where
B: Ord,
F: Fn(&T) -> B + Sync,
K: Ord,
F: Fn(&T) -> K + Sync,
{
par_quicksort(self.as_parallel_slice_mut(), |a, b| f(a).lt(&f(b)));
}
Expand Down