Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seq: use Floyd's combination algorithm to sample indices #479

Merged
merged 14 commits into from Jul 30, 2018

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented May 25, 2018

Adapt #144 for sampling indices

This isn't the fully generic implementation that @burdges wrote; we could make this more generic. I'm guessing we could include an arbitrary lower bound without performance overhead (since the compiler can probably eliminate a constant of 0), and could also use generic typing.

I'm not really sure if it should be committed like this. It cleans up the code and improves performance in some cases, but isn't optimal in others. We could keep the old sample_indices_inplace and just replace sample_indices_cache.

# before
test gen_1k_sample_iter              ... bench:         390 ns/iter (+/- 4) = 2625 MB/s                                                
test misc_sample_indices_100_of_1k   ... bench:         806 ns/iter (+/- 8)                                                            
test misc_sample_indices_10_of_1k    ... bench:         594 ns/iter (+/- 9)                                                            
test misc_sample_indices_50_of_1k    ... bench:         472 ns/iter (+/- 8)                                                            
test misc_sample_iter_10_of_100      ... bench:       1,072 ns/iter (+/- 15)                                                           
test misc_sample_slice_10_of_100     ... bench:         160 ns/iter (+/- 2)                                                            
test misc_sample_slice_ref_10_of_100 ... bench:         160 ns/iter (+/- 2)                                                            
# after                                                                                                       
test gen_1k_sample_iter              ... bench:         391 ns/iter (+/- 27) = 2618 MB/s
test misc_sample_indices_100_of_1k   ... bench:       2,031 ns/iter (+/- 39)
test misc_sample_indices_10_of_1k    ... bench:          90 ns/iter (+/- 1)
test misc_sample_indices_50_of_1k    ... bench:         649 ns/iter (+/- 11)
test misc_sample_iter_10_of_100      ... bench:       1,083 ns/iter (+/- 16)
test misc_sample_slice_10_of_100     ... bench:         157 ns/iter (+/- 1)
test misc_sample_slice_ref_10_of_100 ... bench:         158 ns/iter (+/- 3)       

Edit: now:

test misc_sample_indices_100_of_1k           ... bench:         515 ns/iter (+/- 24)
test misc_sample_indices_10_of_1k            ... bench:          86 ns/iter (+/- 3)
test seq_iter_choose_multiple_10_of_100      ... bench:       1,137 ns/iter (+/- 319)
test seq_iter_choose_multiple_fill_10_of_100 ... bench:       1,088 ns/iter (+/- 35)
test seq_slice_choose_multiple_10_of_100     ... bench:         139 ns/iter (+/- 8)

@dhardy
Copy link
Member Author

dhardy commented May 25, 2018

CC @vitiral who wrote the old implementation

@dhardy
Copy link
Member Author

dhardy commented May 26, 2018

I realise that this is O(m^2) due to the use of Vec::contains, so for large m=amount it would make sense to use a tree or hash set instead, however for small m a simple Vec works very well.

The partial Fisher-Yates shuffle avoids the need for comparison and should thus be faster when the time/memory needed to allocate the whole sample space is not significant; essentially this algorithm is O(n) for n=length.

Does this mean that for efficiency we should have three backing implementations? Not ideal.


It may be worth pointing out that there is a shuffling variant of Floyd's algorithm (if s contains t, then put j just before t in s), however it may be difficult to efficiently implement (the buffer must be ordered and should allow efficient insertion anywhere).

@vitiral
Copy link
Contributor

vitiral commented May 28, 2018

@dhardy thanks for including me, but unfortunately I won't be able to comment for a while because of other commitments.

@dhardy
Copy link
Member Author

dhardy commented Jun 1, 2018

Some more benchmarks:

test misc_sample_indices_cache_1000_of_100k   ... bench:      64,554 ns/iter (+/- 6,677)
test misc_sample_indices_cache_1000_of_10k    ... bench:      69,582 ns/iter (+/- 4,090)
test misc_sample_indices_cache_1000_of_1M     ... bench:      61,931 ns/iter (+/- 1,209)
test misc_sample_indices_cache_1000_of_1k     ... bench:      66,852 ns/iter (+/- 1,548)
test misc_sample_indices_cache_100_of_100k    ... bench:       7,502 ns/iter (+/- 487)
test misc_sample_indices_cache_100_of_10k     ... bench:       7,899 ns/iter (+/- 259)
test misc_sample_indices_cache_100_of_1M      ... bench:       7,216 ns/iter (+/- 349)
test misc_sample_indices_cache_100_of_1k      ... bench:       7,218 ns/iter (+/- 211)
test misc_sample_indices_cache_10_of_100k     ... bench:         629 ns/iter (+/- 23)
test misc_sample_indices_cache_10_of_10k      ... bench:         660 ns/iter (+/- 21)
test misc_sample_indices_cache_10_of_1M       ... bench:         594 ns/iter (+/- 27)
test misc_sample_indices_cache_10_of_1k       ... bench:         594 ns/iter (+/- 16)

test misc_sample_indices_floyd_1000_of_100k   ... bench:     118,336 ns/iter (+/- 3,468)
test misc_sample_indices_floyd_1000_of_10k    ... bench:     120,771 ns/iter (+/- 2,713)
test misc_sample_indices_floyd_1000_of_1M     ... bench:     114,288 ns/iter (+/- 4,075)
test misc_sample_indices_floyd_1000_of_1k     ... bench:      69,342 ns/iter (+/- 1,754)
test misc_sample_indices_floyd_100_of_100k    ... bench:       2,376 ns/iter (+/- 95)
test misc_sample_indices_floyd_100_of_10k     ... bench:       2,991 ns/iter (+/- 45)
test misc_sample_indices_floyd_100_of_1M      ... bench:       1,776 ns/iter (+/- 100)
test misc_sample_indices_floyd_100_of_1k      ... bench:       2,011 ns/iter (+/- 69)
test misc_sample_indices_floyd_10_of_100k     ... bench:         127 ns/iter (+/- 4)
test misc_sample_indices_floyd_10_of_10k      ... bench:         166 ns/iter (+/- 10)
test misc_sample_indices_floyd_10_of_1M       ... bench:          96 ns/iter (+/- 4)
test misc_sample_indices_floyd_10_of_1k       ... bench:          95 ns/iter (+/- 4)

test misc_sample_indices_inplace_1000_of_100k ... bench:      37,888 ns/iter (+/- 1,775)
test misc_sample_indices_inplace_1000_of_10k  ... bench:      15,556 ns/iter (+/- 706)
test misc_sample_indices_inplace_1000_of_1M   ... bench:   1,684,064 ns/iter (+/- 32,348)
test misc_sample_indices_inplace_1000_of_1k   ... bench:      10,121 ns/iter (+/- 269)
test misc_sample_indices_inplace_100_of_100k  ... bench:      29,498 ns/iter (+/- 914)
test misc_sample_indices_inplace_100_of_10k   ... bench:       3,368 ns/iter (+/- 141)
test misc_sample_indices_inplace_100_of_1M    ... bench:   1,678,077 ns/iter (+/- 69,038)
test misc_sample_indices_inplace_100_of_1k    ... bench:         813 ns/iter (+/- 33)
test misc_sample_indices_inplace_10_of_100k   ... bench:      28,706 ns/iter (+/- 624)
test misc_sample_indices_inplace_10_of_10k    ... bench:       2,280 ns/iter (+/- 43)
test misc_sample_indices_inplace_10_of_1M     ... bench:   1,661,896 ns/iter (+/- 20,477)
test misc_sample_indices_inplace_10_of_1k     ... bench:         223 ns/iter (+/- 4)

test misc_sample_indices_floyd_hash_1000_of_100k ... bench:      97,342 ns/iter (+/- 1,582)
test misc_sample_indices_floyd_hash_1000_of_10k  ... bench:     102,381 ns/iter (+/- 1,273)
test misc_sample_indices_floyd_hash_1000_of_1M   ... bench:      94,833 ns/iter (+/- 1,037)
test misc_sample_indices_floyd_hash_1000_of_1k   ... bench:     105,910 ns/iter (+/- 1,225)
test misc_sample_indices_floyd_hash_100_of_100k  ... bench:       7,376 ns/iter (+/- 135)
test misc_sample_indices_floyd_hash_100_of_10k   ... bench:       7,726 ns/iter (+/- 281)
test misc_sample_indices_floyd_hash_100_of_1M    ... bench:       7,097 ns/iter (+/- 126)
test misc_sample_indices_floyd_hash_100_of_1k    ... bench:       7,133 ns/iter (+/- 336)
test misc_sample_indices_floyd_hash_10_of_100k   ... bench:         585 ns/iter (+/- 13)
test misc_sample_indices_floyd_hash_10_of_10k    ... bench:         625 ns/iter (+/- 12)
test misc_sample_indices_floyd_hash_10_of_1M     ... bench:         559 ns/iter (+/- 19)
test misc_sample_indices_floyd_hash_10_of_1k     ... bench:         558 ns/iter (+/- 20)

test misc_sample_indices_floyd_tree_1000_of_100k ... bench:     126,655 ns/iter (+/- 2,071)
test misc_sample_indices_floyd_tree_1000_of_10k  ... bench:     130,358 ns/iter (+/- 5,582)
test misc_sample_indices_floyd_tree_1000_of_1M   ... bench:     123,179 ns/iter (+/- 3,541)
test misc_sample_indices_floyd_tree_1000_of_1k   ... bench:     112,819 ns/iter (+/- 3,700)
test misc_sample_indices_floyd_tree_100_of_100k  ... bench:       8,533 ns/iter (+/- 579)
test misc_sample_indices_floyd_tree_100_of_10k   ... bench:       9,081 ns/iter (+/- 542)
test misc_sample_indices_floyd_tree_100_of_1M    ... bench:       8,155 ns/iter (+/- 246)
test misc_sample_indices_floyd_tree_100_of_1k    ... bench:       8,231 ns/iter (+/- 224)
test misc_sample_indices_floyd_tree_10_of_100k   ... bench:         514 ns/iter (+/- 10)
test misc_sample_indices_floyd_tree_10_of_10k    ... bench:         563 ns/iter (+/- 25)
test misc_sample_indices_floyd_tree_10_of_1M     ... bench:         476 ns/iter (+/- 18)
test misc_sample_indices_floyd_tree_10_of_1k     ... bench:         475 ns/iter (+/- 6)

Observations:

  • performance of cache method depends only on amount but has high overhead
  • performance of Floyd's alg should depend only on amount but has some variance (it's actually better for large length)
  • using a HashSet or BTreeSet to speed up contains check in Floyd's (to beat O(m^2)) only boosts performance from around 1000 elts, and never beats the cache method
    // Fastest for various params:
    // amount = 10, length >= 1k: floyd
    // amount = 100, length = 1k: inplace
    // amount = 100, length >= 10k: floyd
    // amount = 1000, length in {1k, 10k, 100k}: inplace
    // amount = 1000, length = 1M: cache

I'll do some more testing to try to find the best algorithm (on my system, at least).

@dhardy
Copy link
Member Author

dhardy commented Jun 1, 2018

Note: above benchmarks are with usize in all cases; as mentioned in #202 I decided to make the inline method use u32 instead. Benches now:

test misc_sample_indices_inplace_1000_of_100k ... bench:      19,660 ns/iter (+/- 1,147)
test misc_sample_indices_inplace_1000_of_10k  ... bench:      11,082 ns/iter (+/- 357)
test misc_sample_indices_inplace_1000_of_1M   ... bench:     818,990 ns/iter (+/- 25,323)
test misc_sample_indices_inplace_1000_of_1k   ... bench:       6,889 ns/iter (+/- 515)
test misc_sample_indices_inplace_1000_of_316k ... bench:      54,135 ns/iter (+/- 2,276)
test misc_sample_indices_inplace_1000_of_32k  ... bench:       7,327 ns/iter (+/- 603)
test misc_sample_indices_inplace_1000_of_3k   ... bench:       8,500 ns/iter (+/- 184)
test misc_sample_indices_inplace_100_of_100k  ... bench:      14,622 ns/iter (+/- 670)
test misc_sample_indices_inplace_100_of_10k   ... bench:       2,061 ns/iter (+/- 124)
test misc_sample_indices_inplace_100_of_1M    ... bench:     811,210 ns/iter (+/- 36,437)
test misc_sample_indices_inplace_100_of_1k    ... bench:         509 ns/iter (+/- 18)
test misc_sample_indices_inplace_100_of_316k  ... bench:      45,695 ns/iter (+/- 1,737)
test misc_sample_indices_inplace_100_of_32k   ... bench:       3,697 ns/iter (+/- 121)
test misc_sample_indices_inplace_100_of_3k    ... bench:         859 ns/iter (+/- 24)
test misc_sample_indices_inplace_10_of_100k   ... bench:      13,532 ns/iter (+/- 598)
test misc_sample_indices_inplace_10_of_10M    ... bench:   8,882,829 ns/iter (+/- 153,151)
test misc_sample_indices_inplace_10_of_10k    ... bench:       1,212 ns/iter (+/- 20)
test misc_sample_indices_inplace_10_of_1M     ... bench:     806,483 ns/iter (+/- 4,755)
test misc_sample_indices_inplace_10_of_1k     ... bench:         130 ns/iter (+/- 5)
test misc_sample_indices_inplace_10_of_30M    ... bench:  28,095,419 ns/iter (+/- 3,229,494)
test misc_sample_indices_inplace_10_of_316k   ... bench:      45,786 ns/iter (+/- 3,537)
test misc_sample_indices_inplace_10_of_32k    ... bench:       3,381 ns/iter (+/- 201)
test misc_sample_indices_inplace_10_of_3M     ... bench:   2,604,510 ns/iter (+/- 92,508)
test misc_sample_indices_inplace_10_of_3k     ... bench:         302 ns/iter (+/- 9)
test misc_sample_indices_inplace_10_of_500k   ... bench:      74,140 ns/iter (+/- 17,188)
test misc_sample_indices_inplace_10_of_600k   ... bench:     478,341 ns/iter (+/- 11,052)
test misc_sample_indices_inplace_10_of_700k   ... bench:     559,066 ns/iter (+/- 12,078)
test misc_sample_indices_inplace_316_of_100k  ... bench:      15,666 ns/iter (+/- 482)
test misc_sample_indices_inplace_316_of_10k   ... bench:       4,107 ns/iter (+/- 135)
test misc_sample_indices_inplace_316_of_1M    ... bench:     805,479 ns/iter (+/- 22,227)
test misc_sample_indices_inplace_316_of_1k    ... bench:       1,795 ns/iter (+/- 90)
test misc_sample_indices_inplace_316_of_316k  ... bench:      47,693 ns/iter (+/- 2,916)
test misc_sample_indices_inplace_316_of_32k   ... bench:       4,532 ns/iter (+/- 101)
test misc_sample_indices_inplace_316_of_3k    ... bench:       2,322 ns/iter (+/- 47)
test misc_sample_indices_inplace_32_of_100k   ... bench:      14,121 ns/iter (+/- 408)
test misc_sample_indices_inplace_32_of_10k    ... bench:       1,403 ns/iter (+/- 52)
test misc_sample_indices_inplace_32_of_1M     ... bench:     805,801 ns/iter (+/- 41,423)
test misc_sample_indices_inplace_32_of_1k     ... bench:         221 ns/iter (+/- 29)
test misc_sample_indices_inplace_32_of_316k   ... bench:      45,390 ns/iter (+/- 1,932)
test misc_sample_indices_inplace_32_of_32k    ... bench:       3,571 ns/iter (+/- 218)
test misc_sample_indices_inplace_32_of_3k     ... bench:         439 ns/iter (+/- 26)

@dhardy
Copy link
Member Author

dhardy commented Jun 1, 2018

These were the best models of execution time I was able to make. I have a Gnumeric spread-sheet of the models. Text:

Models:
n = length
m = amount
b1 = 11
b2 = 0.11
c1 = 67.9
d1=6
d2=1
e1=0.025
e2=1.15
f1=500000
f2=0.9

Floyd's alg
f(m) = (b1 + b2*m)*m = b1*m + b2*m^2
Cache
c(m) = c1*m
Inplace
i(m,n) = if n < f1 { d1*m^d2 + e1*n^e2 } else { f2*n }
    = if n < f1 { d1*m + e1*n^e2 } else { f2*n }

Find the boundaries:

f(m) < c(m)
=>  b1*m + b2*m^2 < c1*m
=>  b1 + b2*m < c1
=>  m < (c1 - b1) / b2 ~= 517

i(m,n) < c(m)
=>  if n < f1 { d1*m + e1*n^e2 } else { f2*n } < c1*m
=>  if n < f1 { d1*m + e1*n^e2 < c1*m } else { f2*n < c1*m }
=>  if n < f1 { e1*n^e2 < (c1 - d1)*m } else { f2*n < c1*m }

i(m,n) < f(m)
=>  if n < f1 { d1*m + e1*n^e2 } else { f2*n } < b1*m + b2*m^2
=>  if n < f1 { d1*m + e1*n^e2 < b1*m + b2*m^2 } else { f2*n < b1*m + b2*m^2 }
=>  if n < f1 { e1*n^e2 < (b1 - d1)*m + b2*m^2 } else { f2*n < b1*m + b2*m^2 }

Unfortunately performance of the "inplace" model is hard to model, and n^e2 requires use of pow or exp to compute (slow, std-only). Simplifying the model with e1=0.1, e2=1 is perhaps good enough for our purposes. This lets us simplify the models:

i(m,n) = if n < f1 { d1*m + e1*n } else { f2*n }

i(m,n) < c(m)
=>  if n < f1 { e1*n < (c1 - d1)*m } else { f2*n < c1*m }
~>  if n < 500'000 { 0.025*n < 61.9*m } else { 0.9*n < 67.9*m }

i(m,n) < f(m)
=>  if n < f1 { e1*n < (b1 - d1)*m + b2*m^2 } else { f2*n < b1*m + b2*m^2 }
=>  if n < 500'000 { 0.025*n < 5*m + 0.11*m^2 } else { 0.9*n < 11*m + 0.11*m^2 }

So:
if n < 500'000 {
    if m < 517 {
        if 0.025*n < 5*m + 0.1*m^2 {
            use inplace
        } else {
            use floyd
        }
    } else {
        if 0.025*n < 62*m {
            use inplace
        } else {
            use cached
        }
    }
} else {
    if m < 517 {
        if 0.9*n < 11*m + 0.1*m^2 {
            use inplace
        } else {
            use floyd
        }
    } else {
        if 0.9*n < 68*m {
            use inplace
        } else {
            use cached
        }
    }
}

Of course, this is still complicated. Possibly we can multiply everything by some large-ish value (avoiding overflow) then use integer approximations — at least, where we know n,m are small.

A further thing of note is that the inplace method would have memory issues with very large n. The jump at n=500'000 (approx) above is due to a very noticeable slow-down, presumably when paging comes into play (this is approx. 2kB). We could simply never use inplace for n>500'000 (though it's still potentially the fastest for large amount).

Another thing to note: Floyd's alg is the only practical option if there is no allocator, and should alway use the least memory.

For extremely large m, Floyd's alg is unusably slow (remember, it's O(m^2)), and the cached alg uses a lot of memory.

Thoughts? It does seem silly doing a huge amount of work to find the fastest algorithm.

@dhardy
Copy link
Member Author

dhardy commented Jun 1, 2018

I reduced the logic slightly (edit: this version has j wrong; committed version is correct):

    if amount < 517 {
        const C: [[usize; 2]; 2] = [[1, 36], [200, 440]];
        let j = (length < 500_000) as usize;
        let m4 = 4 * amount;
        if C[0][j] * length < (C[1][j] + m4) * amount {
            sample_indices_inplace(rng, length as u32, amount as u32)
                .into_iter()
                .map(|x| x as usize)
                .collect()
        } else {
            sample_indices_floyd(rng, length, amount as u32)
        }
    } else {
        const C: [[usize; 2]; 2] = [[1, 36], [62*40, 68*40]];
        let j = (length < 500_000) as usize;
        if C[0][j] * length < C[1][j] * amount {
            sample_indices_inplace(rng, length as u32, amount as u32)
                .into_iter()
                .map(|x| x as usize)
                .collect()
        } else {
            sample_indices_cache(rng, length, amount)
        }
    }

This now comes with a warning, though it's barely necessary:

Panics if amount > length; may panic with extremely large amount or
length (when 36*length or 2720*amount overflows usize).

Although we have to convert the arrays to usize after the inplace algorithm, this doesn't have a big impact in the benchmarks (whereas using u32 internally did), so I guess we just live with it.

I think it may also be worth exposing the various implementations directly. I will also investigate making the Floyd version usable without an allocator.

@dhardy
Copy link
Member Author

dhardy commented Jun 1, 2018

PR updated. Benchmark summary:

# old:
test misc_sample_indices_100_of_1G   ... bench:       7,221 ns/iter (+/- 190)
test misc_sample_indices_100_of_1M   ... bench:       7,169 ns/iter (+/- 439)
test misc_sample_indices_100_of_1k   ... bench:         805 ns/iter (+/- 44)
test misc_sample_indices_10_of_1k    ... bench:         596 ns/iter (+/- 11)
test misc_sample_indices_1_of_1k     ... bench:          79 ns/iter (+/- 3)
test misc_sample_indices_1k_of_1G    ... bench:      61,763 ns/iter (+/- 2,831)

# new:
test misc_sample_indices_100_of_1G   ... bench:       1,903 ns/iter (+/- 76)
test misc_sample_indices_100_of_1M   ... bench:       1,857 ns/iter (+/- 87)
test misc_sample_indices_100_of_1k   ... bench:         557 ns/iter (+/- 48)
test misc_sample_indices_10_of_1k    ... bench:         167 ns/iter (+/- 9)
test misc_sample_indices_1_of_1k     ... bench:          26 ns/iter (+/- 0)
test misc_sample_indices_1k_of_1G    ... bench:      61,545 ns/iter (+/- 4,593)

Who'd have thought sampling a unique subset of elements would be so complex! I think it's done now.

@dhardy
Copy link
Member Author

dhardy commented Jun 2, 2018

Interesting. The tests fail on x86 because usize is only 32-bit and I wanted to test with big numbers.

This brings up a small dilemma — do users ever actually want to sample indices in excess of 2^32-1? Regarding actual indices, I doubt it (this would imply a byte-array in excess of 4 GiB or 32-bit word array in excess of 16 GiB). For other possible uses, I don't know (though the functions aren't exactly optimal for all uses anyway — e.g. for sampling from a range starting above 0).

Possible tweaks:

  • Only support u32 indices. This saves us having to expand (u32 → usize) values in sample_indices; users can instead expand when indexing an array.
  • u32 and u64 versions. Probably unnecessary. Note that we wouldn't want u64 impl of the inplace method, so that's only 5 back-ends and 2 front-ends (the sample_indices function). We could potentially use generics.
  • Support non-zero starting point (i.e. low..high instead of 0..length range). All algs could be adapted; not sure how much overhead, or how often low != 0 would be used.
  • Support Floyd's without an allocator. Simple to code, but significantly easier to just add a new method for this than make the current method support both optimally. Given the likely low number of users, I'm not sure there's much point.

@burdges
Copy link
Contributor

burdges commented Jun 2, 2018

Yes u32 indices sounds fine. There a good chance that anyone who needs more might need custom optimizations anyways.

@dhardy
Copy link
Member Author

dhardy commented Jun 2, 2018

I chose to support u32 only. This implementation casts down from usize input and panics on overflow (instead of just dropping the high-bits like as does). Benchmarks are now a little better:

test misc_sample_indices_100_of_1G   ... bench:       1,604 ns/iter (+/- 46)
test misc_sample_indices_100_of_1M   ... bench:       1,539 ns/iter (+/- 113)
test misc_sample_indices_100_of_1k   ... bench:         513 ns/iter (+/- 20)
test misc_sample_indices_10_of_1k    ... bench:         133 ns/iter (+/- 1)
test misc_sample_indices_1_of_1k     ... bench:          26 ns/iter (+/- 1)
test misc_sample_indices_1k_of_1G    ... bench:      55,675 ns/iter (+/- 1,318)

@dhardy dhardy added T-sequences Topic: sequences D-review Do: needs review labels Jun 2, 2018
@dhardy
Copy link
Member Author

dhardy commented Jun 2, 2018

I noticed that we lost one feature: the output is now not fully randomised. Floyd's algorithm has a variant with full randomisation (without requiring extra random draws), but it requires insertions at arbitrary positions. Since this is only used with lengths under 517 anyway, the result isn't too bad, but it does still reduce performance:

# before; other tests are above:
test misc_sample_indices_500_of_1G   ... bench:      30,343 ns/iter (+/- 650)
test misc_sample_indices_600_of_1G   ... bench:      35,488 ns/iter (+/- 1,488)
# with full randomisation:
test misc_sample_indices_100_of_1G   ... bench:       2,745 ns/iter (+/- 38)
test misc_sample_indices_100_of_1M   ... bench:       2,732 ns/iter (+/- 100)
test misc_sample_indices_100_of_1k   ... bench:         506 ns/iter (+/- 19)
test misc_sample_indices_10_of_1k    ... bench:         132 ns/iter (+/- 6)
test misc_sample_indices_1_of_1k     ... bench:          26 ns/iter (+/- 1)
test misc_sample_indices_500_of_1G   ... bench:      40,744 ns/iter (+/- 1,988) # uses floyd's alg
test misc_sample_indices_600_of_1G   ... bench:      35,456 ns/iter (+/- 1,232) # uses cached alg

Obviously this implies the selection heuristics need adjustment, but is it worth adding up to 77% overhead for a feature not everyone wants? Probably not. On the other hand, some people will want it, and in many cases the overhead is small or even zero.

It is possible to add a shuffled: bool parameter to sample_indices with very little overhead (except a bit more code in the Floyd impl); is this worthwhile? It also implies that the parameter should be added to sample_slice.

@dhardy
Copy link
Member Author

dhardy commented Jun 2, 2018

Updated with the mentioned shuffled parameter. I don't exactly like the new API, but regarding performance/capabilities this seems to be a good choice. Comments?

New benchmarks:

# full shuffling:
test misc_sample_indices_100_of_1G   ... bench:       2,699 ns/iter (+/- 31)
test misc_sample_indices_100_of_1M   ... bench:       2,694 ns/iter (+/- 45)
test misc_sample_indices_100_of_1k   ... bench:         508 ns/iter (+/- 1)
test misc_sample_indices_10_of_1k    ... bench:          71 ns/iter (+/- 0)
test misc_sample_indices_1_of_1k     ... bench:          28 ns/iter (+/- 0)
test misc_sample_indices_400_of_1G   ... bench:      27,095 ns/iter (+/- 1,254) # floyd's
test misc_sample_indices_500_of_1G   ... bench:      28,048 ns/iter (+/- 3,225) # cached
# without full shuffling:
test misc_sample_indices_100_of_1G   ... bench:       1,625 ns/iter (+/- 18)
test misc_sample_indices_100_of_1M   ... bench:       1,580 ns/iter (+/- 88)
test misc_sample_indices_100_of_1k   ... bench:         515 ns/iter (+/- 1)
test misc_sample_indices_10_of_1k    ... bench:          75 ns/iter (+/- 0)
test misc_sample_indices_1_of_1k     ... bench:          28 ns/iter (+/- 3)
test misc_sample_indices_400_of_1G   ... bench:      20,399 ns/iter (+/- 116)
test misc_sample_indices_500_of_1G   ... bench:      27,722 ns/iter (+/- 245)

Model parameters are now:

b1 = 7
b2 = 0.12 # compromise; better would be 0.1 without shuffling and 0.14 with
c1 = 60
d1=6
d2=1
e1=0.1
e2=1
f1=500000
f2=0.9

@dhardy dhardy added the B-API Breakage: API label Jun 2, 2018
@vitiral
Copy link
Contributor

vitiral commented Jun 2, 2018 via email

@dhardy dhardy force-pushed the sample_floyd branch 2 times, most recently from 5c6be22 to 299307d Compare June 7, 2018 12:05
@dhardy
Copy link
Member Author

dhardy commented Jun 7, 2018

Rebased for 0.6 master branch. I think this is ready to be merged now, if the API changes are acceptable?

I know it's very clunky having an extra parameter on the methods and two implementations of Floyd's algorithm, but considering the shuffling is sometimes useful and quite a lot cheaper than shuffling later, this seems a good choice?

@vks
Copy link
Collaborator

vks commented Jun 7, 2018

The heuristics seem very machine-dependent (or even compiler-flag-dependent). Any way to test them on different setups? Maybe it also makes sense to expose the different algorithms so users can pick their preferred one.

@dhardy
Copy link
Member Author

dhardy commented Jun 7, 2018

Yes, the heuristics are very tailored to one machine. In theory the work could be extended to other platforms, but that's beyond the effort I want to go to. (I also didn't do anything to automate the model production or even to read values from the benchmarks into the spreadsheet used, so it's a bit of work for each platform. Shouldn't be that difficult however to replicate though using my spreadsheet and benchmarks.)

Exposing the individual algorithms increases the API surface and makes changes harder. It doesn't seem very worth doing.

@pitdicker
Copy link
Contributor

I would like to have a good look, as it also is a bit subtle. I hope to get to it tomorrow. Sorry for the delay...

Copy link
Contributor

@pitdicker pitdicker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive set of benchmarks! And the spreadsheet helps 😄.

I think making the choice for shuffled results looks a bit clunky, but is a good idea overall.
I don't like the change to use only u32, we can do better with a little effort. While it is a quite reasonable limitation, it also seems like just the thing to work in testing and fail in production.

src/seq.rs Outdated
if C[0][j] * (length as u64) < (C[1][j] + m4) * amount as u64 {
sample_indices_inplace(rng, length, amount)
} else if shuffled {
sample_indices_floyd_shuffled(rng, length, amount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the permutation and shuffling algorithm. It really uses the minimum number of values from the RNG possible. But in this case using shuffle is just faster:

let mut indices = sample_indices_floyd(rng, length, amount);
rng.shuffle(&mut indices);
indices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess it could be. I should have benchmarked that!

src/seq.rs Outdated
sample_indices_inplace(rng, length, amount)
if length > (::core::u32::MAX as usize) {
panic!("`length` is not representable as `u32`");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really not a fan of this change. All algorithms can work with similar performance as when using an usize, but it would take a little extra complexity. It would simply make things work without special cases.

I gave it a try for Floyds algorithm here.

And played with optimizing a partial shuffle in https://github.com/pitdicker/rand/blob/slice_random/src/seq.rs#L107 (used in sample_indices_inplace).I am sure something we can come up with something cleaner than that that also works with sample_indices_cache.

Copy link
Member Author

@dhardy dhardy Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will still slow down Floyd's algorithm with small indices since the output vector is larger, hence searching it (contains) is slower when more memory needs to be read. In particular, more results can fit in the cache with smaller indices.

This means that it can be faster to convert the output back to usize later, although that's ugly. Only outputting Vec<u32> was motivated by this problem.

But, yes, I get your point that we shouldn't fail arbitrarily here. I would have accepted u32 inputs only, but thought users may just do x as u32 risking invalid conversion.

@dhardy
Copy link
Member Author

dhardy commented Jun 15, 2018

So sample_indices returns Vec<T> which should be usize but is switched to u32 by this PR for performance gains...

We can just use usize everywhere. If we can somehow make gen_range use the most appropriate size with little overhead, then using the larger size should only increase time by 10-20%, excepting where paging is used (at half the number of elements as now on x64).

We could do what I initially implemented: sample_indices uses usize, but the inplace implementation (and perhaps Floyd) use the u32 type, then the output vector is expanded to usize. Perhaps surprisingly this can still increase performance.

We could try making sample_indices generic over the index type, so long as it is supported by gen_range and supports From<u8> (or Zero and One, plus some arithmetic traits). More flexible but more complicated, and users don't automatically get the performance boost when sampling indices, although we could make choose_multiple automatically use the appropriate version.

We could try to make sample_indices return an Iterator<Item=usize>. Both the inplace and floyd methods need to produce a Vec anyway, so the only advantage there would be to hide the internal type. The cache method could directly return elements instead of accumulating in a Vec, but still needs its cache (HashMap). The skipping method @pitdicker found could probably also yield an iterator directly.

Any thoughts on this? I guess the iterator method may deserve more investigation (I'm not sure how well it would optimise). The generic method sounds both complex to implement and complex to use. The next best option might be only using u32 for the inplace method.

@pitdicker
Copy link
Contributor

Shall I take one last look later today and do the merge?

@vks
Copy link
Collaborator

vks commented Jul 16, 2018

The WASM build failed with the following panic:

thread 'main' panicked at 'unsupported custom section: '.debug_info'', src/wasm_context.rs:646:25
note: Run with `RUST_BACKTRACE=1` for a backtrace.

@dhardy
Copy link
Member Author

dhardy commented Jul 16, 2018

Don't worry about that debug info error: koute/cargo-web#116

Please do @pitdicker!

@vks
Copy link
Collaborator

vks commented Jul 16, 2018

Should we document somewhere that we need at least cargo-web 0.6.14 for WASM support?

Copy link
Contributor

@pitdicker pitdicker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did great work here.

I feel a bit uneasy about IndexVec, but I suppose it is a useful abstraction. Can you improve the doc comment of it to explain why it exists?

And I think it is better if sample_floyd does nothing more than the named algorithm, with the shuffling choices abstracted out of it.

CHANGELOG.md Outdated

### Sequences module
- Optimised and changed return type of the `sample_indices` function. (#479)
- Added `shuffled: bool` input parameter to `sample_indices`. (#479)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer true?

src/seq/index.rs Outdated
/// The output values are fully shuffled. (Overhead is under 50%.)
///
/// This implementation uses `O(amount)` memory and `O(amount^2)` time.
pub fn sample_floyd<R>(rng: &mut R, length: u32, amount: u32) -> IndexVec
Copy link
Contributor

@pitdicker pitdicker Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this function more 'pure'? I.e. add back the shuffled bool, and leave out the later shuffle? Then this function works as the name says, and another function can be convenient...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why would anyone else use this directly? The variable shuffling approach is important for performance.

It's not supposed to be pub...

@pitdicker
Copy link
Contributor

Should we document somewhere that we need at least cargo-web 0.6.14 for WASM support?

That seems to be more of a problem between nightlies and cargo-web. Not really something caused by rand, so I see no reason to document it.

@sicking
Copy link
Contributor

sicking commented Jul 17, 2018

I was hoping to do some actual analysis of the performance of the various options. However it doesn't appear that I'll have time to do that anytime soon. I did get to the point where I have graphs showing performance of various algorithms.

Graph for returning a raw Vec<usize> is here:
https://rawgit.com/sicking/rand_choose_multiple_bench/master/results/vec_graph.html
Graph for creating an enum containing either a HashSet<usize>, a BTreeSet<usize>, or a Vec<usize>, is here:
https://rawgit.com/sicking/rand_choose_multiple_bench/master/results/iter_graph.html

In both cases the result is iterated once to make the comparison more fair. I.e. it accounts for the overhead involved in the more complex iterator required in the latter approach.

This doesn't explore gains from using u32 at all since I didn't get that far.

There's one graph for each length sampled from. On the x-axis is amount of items sampled, on the y-axis is the time (divided by amount of items sampled to make results more comparable).

In both cases made three categories of algorithms:

  • Once that generate results in any order
  • Once that generate results in random order
  • Once that generate results in sorted order

@sicking
Copy link
Contributor

sicking commented Jul 17, 2018

Code for generating the data, as well as the raw data is in https://github.com/sicking/rand_choose_multiple_bench

@dhardy
Copy link
Member Author

dhardy commented Jul 17, 2018

Wow, that's a lot of work you've done on this @sicking; it goes quite a way beyond what I did. This problem seems to have grown from a "small" optimisation into something quite complex.

Result type: my IndexVec and @sicking's SampleResult are quite similar aside from the name. Supporting both into_iter() and into_vec() seems prudent; there are no actual requirements on the performance or mechanics of either. One limitation is that direct iteration still requires an RNG, and without lifetime restrictions on the result type (which can be awkward and are often unnecessary, but potentially the most flexible option), this is only possible by making a sub-generator (really not ideal).

Sorting of elements:

  • many of these algorithms provide fully shuffled output
  • some provide partially shuffled output
  • some algorithms sort the output (btree and floyd_sorted_vec)
  • @pitdicker's skipping method maintains order

Which brings up another point: the skipping method does well in some tests with large input and output. But if the output must be shuffled (as my latest version requires), this will destroy the performance (not tested, but shuffling is at least O(m log m)).

There may be some applications which require stable ordering of samples; perhaps this is best met with a separate sample_indices_stable method.

There are also applications which don't care about ordering at all. I previously argued that we should just give them fully-shuffled samples since it simplifies the API and avoids a few potential bugs; this argument makes sense if the overhead is generally low, but may not if the overhead can be large.


Regarding algorithms: interesting that the main benefit of Floyd is its use of Vec as the backing type. And inplace_rev works pretty well where almost-all elements are required, though isn't shuffled.

@sicking
Copy link
Contributor

sicking commented Jul 17, 2018

One limitation is that direct iteration still requires an RNG, and without lifetime restrictions on the result type (which can be awkward and are often unnecessary, but potentially the most flexible option), this is only possible by making a sub-generator (really not ideal).

I don't quite understand what you're saying here. But none of the approaches I measured use truly lazy generation during iteration. In all approaches all the samples are generated before the sampling function returns. The only difference is whether a Vec<usize> or a enum { Vec<usize>, HashSet<usize>, BTreeSet<usize> } was returned. But either way the returned container always had all generated samples.

I never got to comparing time across those two approaches. But the generated times are intentionally generated such that it's meaningful to do such a comparison.

Which brings up another point: the skipping method does well in some tests with large input and output. But if the output must be shuffled (as my latest version requires), this will destroy the performance

Yeah. I intentionally didn't do any attempts to measure algorithms that first generated results and then shuffled or randomized the result. It looked to me that the timings were close enough between the various algorithms that that would always be slower than using an algorithm that direct produces sorted/shuffled results. But that was an untested assumption.

@sicking
Copy link
Contributor

sicking commented Jul 17, 2018

Result type: my IndexVec and @sicking's SampleResult are quite similar aside from the name. Supporting both into_iter() and into_vec() seems prudent; there are no actual requirements on the performance or mechanics of either.

There's a significant difference though in that SampleResult only has those two functions, whereas IndexVec has more surface area.

But I've already voiced my opinion on that aspect so I won't rehash again.

dhardy added 14 commits July 30, 2018 11:01
The sample_indices_inplace algorithm is inappropriate for large numbers
Update with new benchmark data from `u32` impls of Floyd's and cached
algorithms (inplace alg already used benchmarks from `u32` impl).
Update Floyd's with a balanced model adequate for both shuffled
and unshuffled versions.
Motivation: don't have to worry about overflow whichever index type is used.
Appears to slightly improve some benchmarks, no affect on others.
This is to allow use of u32 or usize internally
This accounts for the "cache" method being replaced by
rejection sampling and now using usize again.
@dhardy
Copy link
Member Author

dhardy commented Jul 30, 2018

I don't quite understand what you're saying here.

Only that considering the result primarily an iterator and secondarily a vector is inappropriate, since the current API is not really compatible with generating values on demand. That's not to say we couldn't do this, but it would be fundamentally quite different.

I don't consider this perfect, but it's still a lot better than the current approach, and perfect is the enemy of good as the saying goes. I'm going to merge once CI passes; there is still time for tweaking before 0.6 is released if need be.

@dhardy dhardy merged commit 9b5db4c into rust-random:master Jul 30, 2018
@dhardy dhardy deleted the sample_floyd branch February 15, 2019 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-API Breakage: API D-review Do: needs review T-sequences Topic: sequences
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants