Weighted choice algorithms #532

dhardy · 2018-06-27T16:20:59Z

Fundamentally I see three types of weighted-choice algorithm:

Calculate weight_sum, take sample = rng.gen_range(0, weight_sum), iterate over elements until cumulative weight exceeds sample then take the previous item.
Calculate a CDF of weights (probably just an array of cumulative weights), take sample as above, then find item by binary search; look up element from the index

As follows:

fn choose_weighted<R, F, I, X>(items: I, weight_fn: F, rng: &mut R) -> Option<T>
where
    R: Rng + ?Sized,
    I: Iterator<T>,
    F: Fn(&T) -> W,
    X: SampleUniform +
        ::core::ops::AddAssign<X> +
        ::core::cmp::PartialOrd<X>
{
    let mut result = if let Some(item) = items.next() {
        item
    } else {
        return None;
    };
    let mut sum = weight_fn(&result);
    
    while let Some(item) = items.next() {
        let weight = weight_fn(&item);
        sum += weight;
        if rng.gen_range(0, sum) < weight {
            result = item;
        }
    }
    Some(result)
}

Where one wants to sample from the same set of weights multiple times, calculating a CDF is the obvious choice since the CDF should require no more memory than the original weights themselves.

Where one wants to sample a single time from a slice, one of the first two choices makes the most sense; since calculating the total weight requires all the work of calculating the CDF except storing the results, using the CDF may often be the best option but this isn't guaranteed.

Where one wants to sample a single time from an iterator, any of the above can be used, but the first two options require either cloning the iterator and iterating twice (not always possible and slightly expensive) or collecting all items into a temporary vector while calculating the sum/CDF, then selecting the required item. In this case the last option may be attractive, though of course sampling the RNG for every item has significant overhead (so probably is only useful for large elements or no allocator).

Which algorithm(s) should we include in Rand?

The method calculating the CDF will often be preferred, so should be included. Unfortunately it requires an allocator (excepting if weighs are provided via mutable reference to a slice), but we should probably not worry about this.

A convenience method to sample from weighted slices would presumably prefer to use the CDF method normally.

For a method to sample from weighted iterators it is less clear which implementation should be used. Although it will not perform well, the last algorithm (i.e. sample code above) may be a nice choice in that it does not require an allocator.

My conclusion: perhaps we should accept #518 in its current form (i.e. WeightedIndex distribution using CDF + binary search, and convenience wrappers for slices), plus consider adding the code here to sample from iterators.

The text was updated successfully, but these errors were encountered:

sicking · 2018-06-27T18:32:43Z

Oh, neat! I hadn't thought of the third option at all! Very cool!

One thing to also keep in mind is that option 1 comes in two flavors: Either the caller providing the total weight, or where we need to calculate it.

For repeated sampling, a CDF seems indeed like the best option.

For single sampling the equation does seem more complicated. It's a tradeoff between how fast the iteraterator is, how fast the RNG is, and even how fast AddAssign and gen_range for the weight type is (in theory someone could use BigInts as weights). And it depends on constraints like if allocation is available, if the iterator is Cloneable, and if the caller can provide the total weight easily.

For non-weighted sampling we're providing all options, and let the caller worry about picking the one that's faster or more convenient for their case. But I don't think that makes sense for weighted sampling given that it's likely much less commonly used.

My thinking as of late has been to not worry about performance for single sampling. Generally performance of operations done once rarely matters. The only case I could think of where it matters is if someone does repeated sampling, but where the weights can change between each sampling. But this seems even more rare. And even in that case the optimal solution depends on if the caller can easily maintain a list of cumulative weights and if the total weight changes or not.

In short, for single sampling it feels like the design space is huge, and the performance often does not matter.

So my suggestion is to just provide performance-optimized API for repeated sampling. Single sampling can then use that same API and just sample once. If that doesn't provide good enough performance they can implement whatever solution fits their constraints the best.

But then also provide APIs optimized for convenience, since often that's at least as useful to callers as providing perf-optimized solutions.

dhardy · 2018-06-27T19:10:08Z

Good reasoning; I agree with you on that. Some single-usage stuff gets used a lot (e.g. gen_range), but it seems less likely with weighted sampling. There are indeed way too many variants to write an optimal general case.

sicking · 2018-06-27T19:29:23Z

Yeah, I suspect gen_range is used both in cases where the range changes between sampling, and when the simplicity of gen_range is more important than the performance benefit of Uniform.

sicking · 2018-07-05T03:14:58Z

Should we close this now that #518 has landed?

dhardy added E-question Participation: opinions wanted X-discussion Type: discussion labels Jun 27, 2018

dhardy closed this as completed Jul 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighted choice algorithms #532

Weighted choice algorithms #532

dhardy commented Jun 27, 2018

sicking commented Jun 27, 2018

dhardy commented Jun 27, 2018

sicking commented Jun 27, 2018

sicking commented Jul 5, 2018

Weighted choice algorithms #532

Weighted choice algorithms #532

Comments

dhardy commented Jun 27, 2018

sicking commented Jun 27, 2018

dhardy commented Jun 27, 2018

sicking commented Jun 27, 2018

sicking commented Jul 5, 2018