Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper method for taking the k smallest elements in an iterator #473

Merged
merged 13 commits into from Dec 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.toml
Expand Up @@ -28,6 +28,7 @@ either = { version = "1.0", default-features = false }
[dev-dependencies]
rand = "0.7"
criterion = "=0" # TODO how could this work with our minimum supported rust version?
paste = "1.0.0" # Used in test_std to instanciate generic tests

[dev-dependencies.quickcheck]
version = "0.9"
Expand Down
20 changes: 20 additions & 0 deletions src/k_smallest.rs
@@ -0,0 +1,20 @@
use alloc::collections::BinaryHeap;
use core::cmp::Ord;

pub(crate) fn k_smallest<T: Ord, I: Iterator<Item = T>>(mut iter: I, k: usize) -> BinaryHeap<T> {
if k == 0 { return BinaryHeap::new(); }

let mut heap = iter.by_ref().take(k).collect::<BinaryHeap<_>>();

for i in iter {
debug_assert_eq!(heap.len(), k);
// Equivalent to heap.push(min(i, heap.pop())) but more efficient.
// This should be done with a single `.peek_mut().unwrap()` but
// `PeekMut` sifts-down unconditionally on Rust 1.46.0 and prior.
if *heap.peek().unwrap() > i {
*heap.peek_mut().unwrap() = i;
}
}

heap
}
39 changes: 39 additions & 0 deletions src/lib.rs
Expand Up @@ -194,6 +194,8 @@ mod group_map;
mod groupbylazy;
mod intersperse;
#[cfg(feature = "use_alloc")]
mod k_smallest;
#[cfg(feature = "use_alloc")]
mod kmerge_impl;
#[cfg(feature = "use_alloc")]
mod lazy_buffer;
Expand Down Expand Up @@ -2284,6 +2286,43 @@ pub trait Itertools : Iterator {
v.into_iter()
}

/// Sort the k smallest elements into a new iterator, in ascending order.
///
/// **Note:** This consumes the entire iterator, and returns the result
/// as a new iterator that owns its elements. If the input contains
/// less than k elements, the result is equivalent to `self.sorted()`.
///
/// This is guaranteed to use `k * sizeof(Self::Item) + O(1)` memory
/// and `O(n log k)` time, with `n` the number of elements in the input.
Copy link
Contributor

@scottmcm scottmcm Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is not what I was expecting this to be. Instead, I was expecting a wrapper around partition_at_index so that it'd be O(n) space & time.

What's implemented is interesting, though -- the lower memory use and O(n log k) is an interesting alternative to the O(n + k log n) of BinaryHeap::from_iter(it).into_sorted_iter().take(k) -- but maybe take some time to figure out a way to communicate these differences in the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottmcm O(n) time is impossible in the general case. Otherwise, it.k_smallest(n) would sort a sequence of length n in linear time.

Re: BinaryHeap::from_iter(it).into_sorted_iter().take(k), wouldn't this have strictly-worse performance characteristics, regarding both time and space? I need to think some more about it once caffeinated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely possible if this only needs to find the k smallest elements and not also sort those elements -- the fact that this is also sorting those elements wasn't obvious from the title. (And I would say, the extra k log k of sorting them is unfortunate if someone just wants the smallest k and doesn't need them sorted. It would be nice to be able to turn them back into a vec::IntoIter for the people who would be fine with heap order, like .k_smallest(10).sum().)

So there are plenty of options here:

Algorithm space time
nth element (unsorted) n n
partial sort n n + k log k
full heap n n + k log n
k-size heap k n log k

(Which does emphasize that BinaryHeap::into_iter_sorted should only be used for not-known-ahead-of-time k.)

Copy link
Contributor Author

@nbraud nbraud Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely possible if this only needs to find the k smallest elements and not also sort those

Agreed; that just seemed like a worse API to be exposing, as:

  • an undefined output order, which happens to be often sorted or almost-sorted (like heap order), can be confusing for the user (I myself thought for a moment that BinaryHeap::into_vec() sorted the result, and was surprised by that, as the examples I was testing with happened to produce a sorted result) ;
  • sorting the end result doesn't change the asymptotic complexity, for any heap-based implementation.

However, if we expose something like a fixed-size heap — and I agree it's a good idea, if only for the .extend() type of use-cases you suggested — users who do not require a sorted result could use that directly.

So there are plenty of options here:

You seem to be including partition-based algorithms — AFAIK, the only way to get better time complexity than O(n log k) — and those do not work directly for iterators; in principle, it's always possible to first collect everything into a Vec, use a partition-based algorithm, then truncate the Vec down to size k, but in practice I would expect this to be much slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful if we really want sorted output, as this forces users to pay runtime for sorting - even if they don't need it.

If API discoverability is an issue, we could still have smallest_k_sorted vs. smallest_k_unsorted as a last resort, clarifying the difference.

///
/// The sorted iterator, if directly collected to a `Vec`, is converted
/// without any extra copying or allocation cost.
///
/// **Note:** This is functionally-equivalent to `self.sorted().take(k)`
/// but much more efficient.
///
/// ```
/// use itertools::Itertools;
///
/// // A random permutation of 0..15
/// let numbers = vec![6, 9, 1, 14, 0, 4, 8, 7, 11, 2, 10, 3, 13, 12, 5];
///
/// let five_smallest = numbers
/// .into_iter()
/// .k_smallest(5);
///
/// itertools::assert_equal(five_smallest, 0..5);
/// ```
#[cfg(feature = "use_alloc")]
fn k_smallest(self, k: usize) -> VecIntoIter<Self::Item>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's itertools policy but maybe if the return type was newtype'd as KSmallestIter or something, it would allow changing it into the future to something more efficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll address that tomorrow <3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention, but that doesn't seem feasible while keeping the guarantee that .collect::<Vec<_>> is “free” (in-place, O(1) time). The stdlib's vec iterator can do it, but does so using specialisation (which isn't stabilised yet)

where Self: Sized,
Self::Item: Ord
{
crate::k_smallest::k_smallest(self, k)
.into_sorted_vec()
.into_iter()
Comment on lines +2322 to +2323
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to use into_iter_sorted, too bad that's unstable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that sounds like something I can push for: rust-lang/rust#76234

}

/// Collect all iterator elements into one of two
/// partitions. Unlike `Iterator::partition`, each partition may
/// have a distinct type.
Expand Down
88 changes: 87 additions & 1 deletion tests/test_std.rs
@@ -1,4 +1,9 @@
use paste;
use permutohedron;
use quickcheck as qc;
use rand::{distributions::{Distribution, Standard}, Rng, SeedableRng, rngs::StdRng};
use rand::{seq::SliceRandom, thread_rng};
use std::{cmp::min, fmt::Debug, marker::PhantomData};
use itertools as it;
use crate::it::Itertools;
use crate::it::ExactlyOneError;
Expand Down Expand Up @@ -354,6 +359,88 @@ fn sorted_by() {
it::assert_equal(v, vec![4, 3, 2, 1, 0]);
}

qc::quickcheck! {
fn k_smallest_range(n: u64, m: u16, k: u16) -> () {
// u16 is used to constrain k and m to 0..2¹⁶,
// otherwise the test could use too much memory.
let (k, m) = (k as u64, m as u64);

// Generate a random permutation of n..n+m
let i = {
let mut v: Vec<u64> = (n..n.saturating_add(m)).collect();
nbraud marked this conversation as resolved.
Show resolved Hide resolved
v.shuffle(&mut thread_rng());
v.into_iter()
};

// Check that taking the k smallest elements yields n..n+min(k, m)
it::assert_equal(
i.k_smallest(k as usize),
n..n.saturating_add(min(k, m))
);
}
}

#[derive(Clone, Debug)]
struct RandIter<T: 'static + Clone + Send, R: 'static + Clone + Rng + SeedableRng + Send = StdRng> {
idx: usize,
len: usize,
rng: R,
_t: PhantomData<T>
}

impl<T: Clone + Send, R: Clone + Rng + SeedableRng + Send> Iterator for RandIter<T, R>
where Standard: Distribution<T> {
type Item = T;
fn next(&mut self) -> Option<T> {
if self.idx == self.len {
None
} else {
self.idx += 1;
Some(self.rng.gen())
}
}
}

impl<T: Clone + Send, R: Clone + Rng + SeedableRng + Send> qc::Arbitrary for RandIter<T, R> {
fn arbitrary<G: qc::Gen>(g: &mut G) -> Self {
RandIter {
idx: 0,
len: g.size(),
rng: R::seed_from_u64(g.next_u64()),
_t : PhantomData{},
}
}
}

// Check that taking the k smallest is the same as
// sorting then taking the k first elements
fn k_smallest_sort<I>(i: I, k: u16) -> ()
where
I: Iterator + Clone,
I::Item: Ord + Debug,
{
let j = i.clone();
let k = k as usize;
it::assert_equal(
i.k_smallest(k),
j.sorted().take(k)
)
}

macro_rules! generic_test {
($f:ident, $($t:ty),+) => {
$(paste::item! {
qc::quickcheck! {
fn [< $f _ $t >](i: RandIter<$t>, k: u16) -> () {
$f(i, k)
}
}
})+
};
}

generic_test!(k_smallest_sort, u8, u16, u32, u64, i8, i16, i32, i64);

#[test]
fn sorted_by_key() {
let sc = [3, 4, 1, 2].iter().cloned().sorted_by_key(|&x| x);
Expand Down Expand Up @@ -387,7 +474,6 @@ fn test_multipeek() {
assert_eq!(mp.next(), Some(5));
assert_eq!(mp.next(), None);
assert_eq!(mp.peek(), None);

}

#[test]
Expand Down