Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SliceRandom::choose_multiple_weighted, implementing weighted sampling without replacement #976

Closed
wants to merge 6 commits into from

Conversation

zrneely
Copy link
Contributor

@zrneely zrneely commented May 11, 2020

This introduces two new methods, intended to close #596 :

SliceRandom::choose_multiple_weighted - similar to SliceRandom::choose_multiple, but without replacement and with weights.
index::sample_weighted - an analogue to index::sample which implements the actual logic.

This notably includes two different implementations of sample_weighted - one which uses an unstable, nightly-only feature of the standard library, "slice_partition_at_index", and one which uses only stable functionality. By my measurements, the performance improvement of using the unstable feature with very large slices is extremely significant (see below). I'm not sure what the right thing to do here is - the options seem to be:

  • Keep things the way they are, and remove the worse implementation if/when the unstable feature is stabilized. I'm not sure how this interacts with rustc version requirements.
  • Copy the unstable feature's logic into this crate. This has the downside that It's a significant amount of logic that arguably doesn't belong in rand, and it involves lots of use of unsafe.

Both implementations are based on Algorithm A described by Efraimidis and Spirakis in their paper on Weighted Random Sampling, available here. They differ in the way they implement step 2, "select the items with the largest keys".

Finally, one last thing I'm uncertain about is the behavior the algorithm should take when given invalid weights - negative values and NaN. The approach I've taken so far is to add a check that the result of the user-specified weight function is valid before computing with it, but the alternative is simply documenting that the behavior of the algorithm is not specified if the weights are invalid.


Here are some benchmarking results taken on Windows on a Ryzen 1700X:

> cargo +nightly bench -- sample_weighted

test misc_sample_weighted_indices_100_of_1M  ... bench:  86,739,580 ns/iter (+/- 3,530,996)
test misc_sample_weighted_indices_100_of_1k  ... bench:      92,856 ns/iter (+/- 1,179)
test misc_sample_weighted_indices_10_of_1k   ... bench:      84,986 ns/iter (+/- 1,144)
test misc_sample_weighted_indices_1_of_1k    ... bench:      83,997 ns/iter (+/- 1,771)
test misc_sample_weighted_indices_1k_of_1M   ... bench:  87,341,100 ns/iter (+/- 508,386)
test misc_sample_weighted_indices_200_of_1M  ... bench:  87,002,990 ns/iter (+/- 989,274)
test misc_sample_weighted_indices_400_of_1M  ... bench:  87,762,280 ns/iter (+/- 4,247,746)
test misc_sample_weighted_indices_600_of_1M  ... bench:  87,204,490 ns/iter (+/- 3,406,199)

> cargo +nightly bench --features="nightly" -- sample_weighted

test misc_sample_weighted_indices_100_of_1M  ... bench:  50,854,360 ns/iter (+/- 5,298,528)
test misc_sample_weighted_indices_100_of_1k  ... bench:      41,186 ns/iter (+/- 11,823)
test misc_sample_weighted_indices_10_of_1k   ... bench:      40,682 ns/iter (+/- 21,438)
test misc_sample_weighted_indices_1_of_1k    ... bench:      35,714 ns/iter (+/- 527)
test misc_sample_weighted_indices_1k_of_1M   ... bench:  50,370,600 ns/iter (+/- 4,227,278)
test misc_sample_weighted_indices_200_of_1M  ... bench:  49,976,930 ns/iter (+/- 4,569,493)
test misc_sample_weighted_indices_400_of_1M  ... bench:  49,964,680 ns/iter (+/- 5,318,092)
test misc_sample_weighted_indices_600_of_1M  ... bench:  50,865,190 ns/iter (+/- 4,911,017)

@@ -23,7 +23,7 @@ appveyor = { repository = "rust-random/rand" }
[features]
# Meta-features:
default = ["std", "std_rng"]
nightly = ["simd_support"] # enables all features requiring nightly rust
nightly = ["simd_support", "partition_at_index"] # enables all features requiring nightly rust
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simpler approach might be to just use the nightly feature instead of introducing a new one. This would have the advantage that we don't have to worry about dropping the feature in the future without breaking code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here is that this allows leaving the feature flag in place even once "slice_partition_at_index" stabilizes, which would allow supporting older rustc versions. I also see your point about dropping the feature being a breaking change, though...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I think we will likely just raise the minimum Rust version. @dhardy What do you think?

Copy link
Member

@dhardy dhardy May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is not possible to reliably support old nightlies, and not very useful either. So removing a feature flag only usable on nightly compilers once that feature has stabilised is not an issue.

src/seq/index.rs Outdated Show resolved Hide resolved
src/seq/index.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator

vks commented May 11, 2020

The implementation looks good to me. I think the general docs in rand::seq could be updated to mention weighted sampling with and without replacement.

@@ -50,6 +50,7 @@
#![doc(test(attr(allow(unused_variables), deny(warnings))))]
#![cfg_attr(not(feature = "std"), no_std)]
#![cfg_attr(all(feature = "simd_support", feature = "nightly"), feature(stdsimd))]
#![cfg_attr(all(feature = "partition_at_index", feature = "nightly"), feature(slice_partition_at_index))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This however makes separate use of partition_at_index pointless — I think unlike simd_support we should not depend on nightly here. As a bonus, this makes it impossible to use partition_at_index on stable compilers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand what the advantage of a separate partition_at_index is.

Comment on lines -11 to +22
#[cfg(feature = "alloc")] use core::slice;
#[cfg(feature = "alloc")]
use core::slice;

#[cfg(all(feature = "alloc", not(feature = "std")))]
use crate::alloc::vec::{self, Vec};
#[cfg(feature = "std")] use std::vec;
#[cfg(feature = "std")]
use std::vec;
// BTreeMap is not as fast in tests, but better than nothing.
#[cfg(all(feature = "alloc", not(feature = "std")))]
use crate::alloc::collections::BTreeSet;
#[cfg(feature = "std")] use std::collections::HashSet;
#[cfg(feature = "std")]
use std::collections::HashSet;
Copy link
Member

@dhardy dhardy May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave the reformatting out please.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The paper (the full one, not what you linked) seems a good source (though I don't follow the integral{CDF × PDF} step).

Comment on lines +309 to +311
if length == 0 {
return Ok(IndexVec::USize(Vec::new()));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, if amount == 0. And this should be pushed up to the top of the method.

fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {
self.key
.partial_cmp(&other.key)
.or(Some(core::cmp::Ordering::Less))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include this line? To avoid panic in Ord::cmp? Does it have some benefit? Because it's technically wrong, and means NaN values could have arbitrary sort position (some sorting algs might not even terminate). Better just to allow a panic IMO.

let mut candidates = Vec::with_capacity(length);
for index in 0..length {
let weight = weight(index).into();
if weight < 0.0 || weight.is_nan() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer !(weight >= 0.0), though it is equivalent.

Comment on lines +269 to +276
pub fn sample_weighted<R, F, X>(
rng: &mut R, length: usize, weight: F, amount: usize,
) -> Result<IndexVec, WeightedError>
where
R: Rng + ?Sized,
F: Fn(usize) -> X,
X: Into<f64>,
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks correct, but it may use a lot of memory without good cause. Suggestions:

If we know length <= u32::MAX, then use u32 instead of usize. I guess in this case we can't since a million entries can be evaluated in under a tenth of a second. It may or may not be worth testing and implementing multiple paths. (IndexVec is like it is for performance reasons.)

An implementation of the A-Res variant should compare well on performance I think (though still O(n)). An implementation of A-ExpJ variant should be the best option for length >> amount. It may be nice having both and selecting which to use with a heuristic, or it may be that A-ExpJ is fast enough to always be a good choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement weighted sampling without replacement
3 participants