Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted choice algorithms #532

Closed
dhardy opened this issue Jun 27, 2018 · 4 comments
Closed

Weighted choice algorithms #532

dhardy opened this issue Jun 27, 2018 · 4 comments
Labels
E-question Participation: opinions wanted X-discussion Type: discussion

Comments

@dhardy
Copy link
Member

dhardy commented Jun 27, 2018

See also: dhardy#82, #518


Fundamentally I see three types of weighted-choice algorithm:

  1. Calculate weight_sum, take sample = rng.gen_range(0, weight_sum), iterate over elements until cumulative weight exceeds sample then take the previous item.
  2. Calculate a CDF of weights (probably just an array of cumulative weights), take sample as above, then find item by binary search; look up element from the index
  3. As follows:
    fn choose_weighted<R, F, I, X>(items: I, weight_fn: F, rng: &mut R) -> Option<T>
    where
        R: Rng + ?Sized,
        I: Iterator<T>,
        F: Fn(&T) -> W,
        X: SampleUniform +
            ::core::ops::AddAssign<X> +
            ::core::cmp::PartialOrd<X>
    {
        let mut result = if let Some(item) = items.next() {
            item
        } else {
            return None;
        };
        let mut sum = weight_fn(&result);
        
        while let Some(item) = items.next() {
            let weight = weight_fn(&item);
            sum += weight;
            if rng.gen_range(0, sum) < weight {
                result = item;
            }
        }
        Some(result)
    }

Where one wants to sample from the same set of weights multiple times, calculating a CDF is the obvious choice since the CDF should require no more memory than the original weights themselves.

Where one wants to sample a single time from a slice, one of the first two choices makes the most sense; since calculating the total weight requires all the work of calculating the CDF except storing the results, using the CDF may often be the best option but this isn't guaranteed.

Where one wants to sample a single time from an iterator, any of the above can be used, but the first two options require either cloning the iterator and iterating twice (not always possible and slightly expensive) or collecting all items into a temporary vector while calculating the sum/CDF, then selecting the required item. In this case the last option may be attractive, though of course sampling the RNG for every item has significant overhead (so probably is only useful for large elements or no allocator).


Which algorithm(s) should we include in Rand?

The method calculating the CDF will often be preferred, so should be included. Unfortunately it requires an allocator (excepting if weighs are provided via mutable reference to a slice), but we should probably not worry about this.

A convenience method to sample from weighted slices would presumably prefer to use the CDF method normally.

For a method to sample from weighted iterators it is less clear which implementation should be used. Although it will not perform well, the last algorithm (i.e. sample code above) may be a nice choice in that it does not require an allocator.


My conclusion: perhaps we should accept #518 in its current form (i.e. WeightedIndex distribution using CDF + binary search, and convenience wrappers for slices), plus consider adding the code here to sample from iterators.

@dhardy dhardy added E-question Participation: opinions wanted X-discussion Type: discussion labels Jun 27, 2018
@sicking
Copy link
Contributor

sicking commented Jun 27, 2018

Oh, neat! I hadn't thought of the third option at all! Very cool!

One thing to also keep in mind is that option 1 comes in two flavors: Either the caller providing the total weight, or where we need to calculate it.

For repeated sampling, a CDF seems indeed like the best option.

For single sampling the equation does seem more complicated. It's a tradeoff between how fast the iteraterator is, how fast the RNG is, and even how fast AddAssign and gen_range for the weight type is (in theory someone could use BigInts as weights). And it depends on constraints like if allocation is available, if the iterator is Cloneable, and if the caller can provide the total weight easily.

For non-weighted sampling we're providing all options, and let the caller worry about picking the one that's faster or more convenient for their case. But I don't think that makes sense for weighted sampling given that it's likely much less commonly used.

My thinking as of late has been to not worry about performance for single sampling. Generally performance of operations done once rarely matters. The only case I could think of where it matters is if someone does repeated sampling, but where the weights can change between each sampling. But this seems even more rare. And even in that case the optimal solution depends on if the caller can easily maintain a list of cumulative weights and if the total weight changes or not.

In short, for single sampling it feels like the design space is huge, and the performance often does not matter.

So my suggestion is to just provide performance-optimized API for repeated sampling. Single sampling can then use that same API and just sample once. If that doesn't provide good enough performance they can implement whatever solution fits their constraints the best.

But then also provide APIs optimized for convenience, since often that's at least as useful to callers as providing perf-optimized solutions.

@dhardy
Copy link
Member Author

dhardy commented Jun 27, 2018

Good reasoning; I agree with you on that. Some single-usage stuff gets used a lot (e.g. gen_range), but it seems less likely with weighted sampling. There are indeed way too many variants to write an optimal general case.

@sicking
Copy link
Contributor

sicking commented Jun 27, 2018

Yeah, I suspect gen_range is used both in cases where the range changes between sampling, and when the simplicity of gen_range is more important than the performance benefit of Uniform.

@sicking
Copy link
Contributor

sicking commented Jul 5, 2018

Should we close this now that #518 has landed?

@dhardy dhardy closed this as completed Jul 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-question Participation: opinions wanted X-discussion Type: discussion
Projects
None yet
Development

No branches or pull requests

2 participants