Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform distribution: bias and usize portability #809

Merged
merged 6 commits into from Jun 3, 2019

Conversation

dhardy
Copy link
Member

@dhardy dhardy commented May 29, 2019

Bias

I noticed that the zone produced via signed extension is not a multiple of the range. This means that the implementations for 8- and 16-bit types (@pitdicker) were biased. I added a fix for this which does not appear to have any performance impact.

It's worth mentioning however that the bias is tiny: the biggest deviation from a multiple of the range I could find was 32579 (for range = 65355); since this is sampling from a u32 the probability of generating a biased sample is thus ~7.6e-6. With 100 million samples this bias is still lost in the noise.

Note: the included uniformity test is unfinished because it's pretty useless. Included for example but it should probably just be deleted.

Portability for isize/usize samples

As discussed in #805. Unfortunately this does have a performance hit:

# before:
test distr_uniform_isize                   ... bench:       1,628 ns/iter (+/- 198) = 4914 MB/s
test distr_uniform_usize16                 ... bench:       1,617 ns/iter (+/- 121) = 4947 MB/s
test distr_uniform_usize32                 ... bench:       1,612 ns/iter (+/- 146) = 4962 MB/s
test distr_uniform_usize64                 ... bench:       2,473 ns/iter (+/- 79) = 3234 MB/s
# after:
test distr_uniform_isize                   ... bench:       5,818 ns/iter (+/- 121) = 1375 MB/s
test distr_uniform_usize16                 ... bench:       1,721 ns/iter (+/- 21) = 4648 MB/s
test distr_uniform_usize32                 ... bench:       1,803 ns/iter (+/- 20) = 4437 MB/s
test distr_uniform_usize64                 ... bench:       2,426 ns/iter (+/- 15) = 3297 MB/s

It also results in quite a bit of redundant, ugly code. As such I'm not happy about adding it (though it would be nice to have).


Thoughts? @burdges @vks @pitdicker

@burdges
Copy link
Contributor

burdges commented May 29, 2019

It's just the usual hit form rejection sampling, yes?

@dhardy
Copy link
Member Author

dhardy commented May 30, 2019

All of these use rejection sampling, which makes the cost variable (especially when the required range is within a factor of 2 of the type's range). I'm not sure why the isize benchmark here suffers so much; the only additional cost should be a bit more branching.

@vks
Copy link
Collaborator

vks commented May 31, 2019

I'm not sure why the isize benchmark here suffers so much; the only additional cost should be a bit more branching.

Maybe it is branch misprediction? You could try running the benchmarks separately with perf stat.

I think we should have the bias corrections for sure. I'm not so sure about the portability improvements, there is a lot of code duplication (that could be reduced with macros). I would prefer not to promise value-stability for different platforms for platform-dependent types, and state this in the documentation. The tests could be fixed accordingly.

I think for sampling usize everyone expects that this is platform-dependent, but for rand::seq it is less intuitive. We could still consider implementing your fix there instead, or just mention the caveat in the documentation.

@dhardy
Copy link
Member Author

dhardy commented Jun 1, 2019

Good point @vks that this would be better done within seq code. I count eight uses of gen_range and one of Uniform<usize> within the alias-method weighted index implementation. As such, implementing the suggestion is not trivial but not too hard.

Compatibility: the changes to weighted::AliasMethod are breaking if (a) you use exhaustive match on weighted::WeightedError or (b) you use AliasMethod with more than u32::MAX elements (talk about using gigabytes of memory and non-scalable algorithms!).

Performance: there are some minor wins and losses; nothing too significant I think:

# before:
running 19 tests
test misc_sample_indices_100_of_1G           ... bench:       2,562 ns/iter (+/- 122)
test misc_sample_indices_100_of_1M           ... bench:       2,496 ns/iter (+/- 182)
test misc_sample_indices_100_of_1k           ... bench:         549 ns/iter (+/- 23)
test misc_sample_indices_10_of_1k            ... bench:          89 ns/iter (+/- 4)
test misc_sample_indices_1_of_1k             ... bench:          26 ns/iter (+/- 1)
test misc_sample_indices_200_of_1G           ... bench:       5,048 ns/iter (+/- 271)
test misc_sample_indices_400_of_1G           ... bench:       9,010 ns/iter (+/- 228)
test misc_sample_indices_600_of_1G           ... bench:      13,227 ns/iter (+/- 966)
test seq_iter_choose_from_1000               ... bench:       3,341 ns/iter (+/- 46) = 2394 MB/s
test seq_iter_choose_multiple_10_of_100      ... bench:         931 ns/iter (+/- 142)
test seq_iter_choose_multiple_fill_10_of_100 ... bench:         873 ns/iter (+/- 42)
test seq_iter_unhinted_choose_from_1000      ... bench:       4,850 ns/iter (+/- 234)
test seq_iter_window_hinted_choose_from_1000 ... bench:       1,618 ns/iter (+/- 50)
test seq_shuffle_100                         ... bench:         837 ns/iter (+/- 31)
test seq_slice_choose_1_of_1000              ... bench:       3,336 ns/iter (+/- 211) = 2398 MB/s
test seq_slice_choose_multiple_10_of_100     ... bench:         156 ns/iter (+/- 16)
test seq_slice_choose_multiple_1_of_1000     ... bench:          33 ns/iter (+/- 2)
test seq_slice_choose_multiple_90_of_100     ... bench:         961 ns/iter (+/- 102)
test seq_slice_choose_multiple_950_of_1000   ... bench:       9,164 ns/iter (+/- 377)

running 8 tests
test distr_weighted_alias_method_f64       ... bench:      10,569 ns/iter (+/- 479) = 756 MB/s
test distr_weighted_alias_method_i8        ... bench:       9,906 ns/iter (+/- 603) = 807 MB/s
test distr_weighted_alias_method_large_set ... bench:      10,802 ns/iter (+/- 510) = 740 MB/s
test distr_weighted_alias_method_u32       ... bench:       9,898 ns/iter (+/- 915) = 808 MB/s
test distr_weighted_f64                    ... bench:       8,544 ns/iter (+/- 365) = 936 MB/s
test distr_weighted_i8                     ... bench:      11,145 ns/iter (+/- 931) = 717 MB/s
test distr_weighted_large_set              ... bench:      64,942 ns/iter (+/- 2,191) = 123 MB/s
test distr_weighted_u32                    ... bench:      10,906 ns/iter (+/- 414) = 733 MB/s

# after:
running 19 tests
test misc_sample_indices_100_of_1G           ... bench:       2,851 ns/iter (+/- 263)
test misc_sample_indices_100_of_1M           ... bench:       2,787 ns/iter (+/- 33)
test misc_sample_indices_100_of_1k           ... bench:         543 ns/iter (+/- 26)
test misc_sample_indices_10_of_1k            ... bench:          91 ns/iter (+/- 3)
test misc_sample_indices_1_of_1k             ... bench:          26 ns/iter (+/- 0)
test misc_sample_indices_200_of_1G           ... bench:       4,076 ns/iter (+/- 264)
test misc_sample_indices_400_of_1G           ... bench:       8,293 ns/iter (+/- 225)
test misc_sample_indices_600_of_1G           ... bench:      12,066 ns/iter (+/- 363)
test seq_iter_choose_from_1000               ... bench:       3,768 ns/iter (+/- 73) = 2123 MB/s
test seq_iter_choose_multiple_10_of_100      ... bench:         890 ns/iter (+/- 22)
test seq_iter_choose_multiple_fill_10_of_100 ... bench:         882 ns/iter (+/- 32)
test seq_iter_unhinted_choose_from_1000      ... bench:       4,842 ns/iter (+/- 183)
test seq_iter_window_hinted_choose_from_1000 ... bench:       1,816 ns/iter (+/- 41)
test seq_shuffle_100                         ... bench:         941 ns/iter (+/- 123)
test seq_slice_choose_1_of_1000              ... bench:       3,884 ns/iter (+/- 509) = 2059 MB/s
test seq_slice_choose_multiple_10_of_100     ... bench:         178 ns/iter (+/- 70)
test seq_slice_choose_multiple_1_of_1000     ... bench:          33 ns/iter (+/- 4)
test seq_slice_choose_multiple_90_of_100     ... bench:         973 ns/iter (+/- 69)
test seq_slice_choose_multiple_950_of_1000   ... bench:       9,152 ns/iter (+/- 377)

running 8 tests
test distr_weighted_alias_method_f64       ... bench:      10,665 ns/iter (+/- 175) = 750 MB/s
test distr_weighted_alias_method_i8        ... bench:       9,513 ns/iter (+/- 213) = 840 MB/s
test distr_weighted_alias_method_large_set ... bench:      10,378 ns/iter (+/- 537) = 770 MB/s
test distr_weighted_alias_method_u32       ... bench:       9,581 ns/iter (+/- 234) = 834 MB/s
test distr_weighted_f64                    ... bench:       8,508 ns/iter (+/- 215) = 940 MB/s
test distr_weighted_i8                     ... bench:      10,747 ns/iter (+/- 183) = 744 MB/s
test distr_weighted_large_set              ... bench:      65,314 ns/iter (+/- 3,106) = 122 MB/s
test distr_weighted_u32                    ... bench:      10,897 ns/iter (+/- 235) = 734 MB/s

@@ -451,6 +451,18 @@ impl<'a, S: Index<usize, Output = T> + ?Sized + 'a, T: 'a> ExactSizeIterator
}


// Sample a number uniformly between 0 and `ubound`. Uses 32-bit sampling where
// possible, primarily in order to produce the same output on 32-bit and 64-bit
// platforms.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add #[inline] to encourage LLVM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense but has negligible effect on benchmarks (seq_iter_choose_from_1000 and seq_iter_window_hinted_choose_from_1000 still being about 12% slower than before this PR). But we can live with this small hit.

@vks
Copy link
Collaborator

vks commented Jun 3, 2019

  • Uniform distributions for SIMD types are currently broken.
  • Do we want to document value stability among 32- and 64-bit platforms for rand::seq, or is it preferable to leave it unspecified?

Other than the broken tests, this looks good!

@dhardy
Copy link
Member Author

dhardy commented Jun 3, 2019

I dropped the uniformity test which was responsible for most of the failures and appears useless.

@vks
Copy link
Collaborator

vks commented Jun 3, 2019

The remaining failures will be fixed by #813, so I think this can be merged.

The usize64 bench is noticably slower than the others,
perhaps due to use of rejection sampling.
Signed extension of zone was incorrect. This method has
near identical performance in benchmarks.
Primarily for value stability, also slight performance boost.
@dhardy
Copy link
Member Author

dhardy commented Jun 3, 2019

Rebased on master; hopefully it passes this time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants