Speed up Uniform Duration sampling. #583

Pazzaz · 2018-08-05T20:38:13Z

These changes speed up the process of sampling Durations from a uniform distribution primarily by avoiding two expensive operations:

Creating a Duration (now only done once per sample call)
Addition of Durations (now never done)

I also added some simple benchmarks to measure these changes.
Before:

test distr_uniform_duration_large   ... bench:      13,007 ns/iter (+/- 189) = 1230 MB/s
test distr_uniform_duration_largest ... bench:       9,263 ns/iter (+/- 136) = 1727 MB/s
test distr_uniform_duration_one     ... bench:      12,346 ns/iter (+/- 353) = 1295 MB/s
test distr_uniform_duration_variety ... bench:      12,371 ns/iter (+/- 130) = 1293 MB/s

After:

test distr_uniform_duration_large   ... bench:      12,523 ns/iter (+/- 41) = 1277 MB/s
test distr_uniform_duration_largest ... bench:       8,395 ns/iter (+/- 102) = 1905 MB/s
test distr_uniform_duration_one     ... bench:       9,907 ns/iter (+/- 422) = 1615 MB/s
test distr_uniform_duration_variety ... bench:      10,504 ns/iter (+/- 90) = 1523 MB/s

To really avoid any regressions, I also used a more comprehensive benchmark.
The code for those benchmarks (rust to bench and python to graph) can be found here.

Comprehensive Benchmark Results

In these heatmaps, the x axis is the upper bound and the y axis is the lower bound. 800 different Durations were used, ranging from 0s to 18446744074.7095512s, in order.
The z-value is the average number of nanoseconds it took to sample from the corresponding distribution.
Before:

After:

Percentage of improvement:

Zoomed in:

So about a 10-20% improvement and the values outside of that range seem to be just noise.

Edit: changed a call to Duration::from_nanos to Duration::new(nanos / 1_000_000_000, (nanos % 1_000_000_000) as u32) for compatibility with older rust versions. Shouldn't make much of a difference but the benchmark can be rerun.

dhardy · 2018-08-06T12:18:28Z

Looks good, though there might be some ranges which are tricky to sample (e.g. 4.999, 5.001). Since such ranges are (probably) unlikely, this may not be a big issue.

This does bring up two points though:

algorithm complexity (unbounded) should be mentioned
Is it better to optimise for worst or average case? Arguably in general average, but where worst-case performance is terrible

Adding 1e9 - low_n everywhere reduces the size of the rejection zone at the cost of a little extra arithmetic. Doing this and using the Small mode optimisation guarantees that the probability of rejecting a sample is less than half, which is probably good enough since then the expectation of really terrible worst-case performance is negligible.

dhardy · 2018-08-10T13:56:37Z

By the way: your extended benchmarks show a significant performance regression for one set of bounds (I don't know what these are). I'm not keen on accepting this as-is due to the potential for very poor performance on certain bounds (but if we did, then there should definitely be a benchmark highlighting the problem).

Pazzaz · 2018-08-14T10:09:47Z

I hadn't really considered the worst case. I think it should have a much better worst case performance now. I reduced the rejection zone by using an offset of -low_n in the Large case and made Small applicable in more cases. I've spent quite some time trying to optimize this further but I think it's good enough now.

Current bench-difference
old:

running 5 tests
test distr_uniform_duration_edge    ... bench:      13,522 ns/iter (+/- 59) = 1183 MB/s
test distr_uniform_duration_large   ... bench:      12,853 ns/iter (+/- 100) = 1244 MB/s
test distr_uniform_duration_largest ... bench:       8,945 ns/iter (+/- 17) = 1788 MB/s
test distr_uniform_duration_one     ... bench:      12,555 ns/iter (+/- 142) = 1274 MB/s
test distr_uniform_duration_variety ... bench:      12,555 ns/iter (+/- 185) = 1274 MB/s

new:

running 5 tests
test distr_uniform_duration_edge    ... bench:       5,641 ns/iter (+/- 69) = 2836 MB/s
test distr_uniform_duration_large   ... bench:      12,517 ns/iter (+/- 33) = 1278 MB/s
test distr_uniform_duration_largest ... bench:       8,352 ns/iter (+/- 32) = 1915 MB/s
test distr_uniform_duration_one     ... bench:       9,942 ns/iter (+/- 36) = 1609 MB/s
test distr_uniform_duration_variety ... bench:      12,085 ns/iter (+/- 185) = 1323 MB/s

One thing that is a little weird right now is that I've set the offset used in the Large case as a field in the struct instead of as a field on the enum itself. This is simply because I found it to improve the benchmarks.

your extended benchmarks show a significant performance regression for one set of bounds

I think that's just noise caused by my computer prioritizing something else.

dhardy

Nice work. Unfortunately it's also wrong; I'm surprised it didn't panic in your tests! Should be easy to fix though.

dhardy · 2018-08-14T10:09:50Z

src/distributions/uniform.rs

+            high_n = high_n + 1_000_000_000;
+        } else {
+            high_s = high_s;
+            high_n = high_n;


dhardy · 2018-08-14T10:22:59Z

benches/distributions.rs

+    Uniform::new(Duration::new(10000, 423423), Duration::new(200000, 6969954))
+);
+distr_duration!(distr_uniform_duration_edge,
+    Uniform::new_inclusive(Duration::new(u64::max_value() / 10, 999_999_999), Duration::new((u64::max_value() / 10) + 1, 0))


This is clearer if you put the "max / 10" bit above (let binding); you can use any value > 4 (to avoid the Medium sampling method). Edge case with high_ns = 1 may be worse in some implementations and is probably more important to check (keeping low_ns the same).

The Medium sampling method actually occurs for any value > (2^64)/(10^9). But I simplified it by using a const binding and the same number of seconds in distr_uniform_duration_edge and distr_uniform_duration_large

dhardy · 2018-08-14T10:23:52Z

src/distributions/uniform.rs

    mode: UniformDurationMode,
+    offset: u32,


The offset is only used by Large mode so can be moved there.

Edit: sorry, I see your comment about this. Fine. It should make no difference to the size of the struct either way.

dhardy · 2018-08-14T10:25:57Z

src/distributions/uniform.rs

+        match self.mode {
+            UniformDurationMode::Small { secs, nanos } => {
+                let n = nanos.sample(rng);
+                Duration::new(secs, n)


Unfortunately this is now incorrect since it's possible that n > 10^9 (in this case new should panic). Hopefully using an if to check here won't have too big an impact.

Duration::new() actually handles that already. In the documentation it says:

If the number of nanoseconds is greater than 1 billion (the number of nanoseconds in a second), then it will carry over into the seconds provided.

Maybe add a debug_assert?

dhardy · 2018-08-14T10:32:35Z

src/distributions/uniform.rs

+                    let s = secs.sample(rng);
+                    let n = nano_range.sample(rng);
+                    if !(s == max_secs && n > max_nanos) {
+                        let sum = n + self.offset;


After adding the offset back, s * 1_000_000_000 + sum is correct, but again sum may be more than 10^9. Again, using an if to check whether to decrement sum and increment s is probably the best option.

AppVeyorBot · 2018-08-14T11:18:34Z

❌ Build rand 1.0.59 failed (commit 970720fb41 by @Pazzaz)

Pazzaz added 2 commits August 5, 2018 19:38

Add uniform Duration benches

2e1265f

Don't use an offset in UniformDuration

fbab1a1

Pazzaz force-pushed the master branch from baa7a50 to fbab1a1 Compare August 5, 2018 20:59

Pazzaz added 2 commits August 14, 2018 11:34

Improve handling of edge cases in UniformDuration

84ffc8d

Improve benches for uniform duration

79450d0

dhardy reviewed Aug 14, 2018

View reviewed changes

Pazzaz added 2 commits August 14, 2018 12:42

Remove redundant assignment

9ea8394

Change uniform duration edge benchmark

92ff078

dhardy approved these changes Aug 14, 2018

View reviewed changes

dhardy merged commit ea467e7 into rust-random:master Aug 23, 2018

sync-by-unito bot mentioned this pull request Dec 15, 2020

Bump rand_core from 0.5.1 to 0.6.0 AleoHQ/leo#503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Uniform Duration sampling. #583

Speed up Uniform Duration sampling. #583

Pazzaz commented Aug 5, 2018 •

edited

dhardy commented Aug 6, 2018 •

edited

dhardy commented Aug 10, 2018

Pazzaz commented Aug 14, 2018

dhardy left a comment •

edited

dhardy Aug 14, 2018

dhardy Aug 14, 2018

Pazzaz Aug 14, 2018

dhardy Aug 14, 2018 •

edited

dhardy Aug 14, 2018

Pazzaz Aug 14, 2018

vks Aug 14, 2018

dhardy Aug 14, 2018

AppVeyorBot commented Aug 14, 2018

Speed up Uniform Duration sampling. #583

Speed up Uniform Duration sampling. #583

Conversation

Pazzaz commented Aug 5, 2018 • edited

dhardy commented Aug 6, 2018 • edited

dhardy commented Aug 10, 2018

Pazzaz commented Aug 14, 2018

dhardy left a comment • edited

Choose a reason for hiding this comment

dhardy Aug 14, 2018

Choose a reason for hiding this comment

dhardy Aug 14, 2018

Choose a reason for hiding this comment

Pazzaz Aug 14, 2018

Choose a reason for hiding this comment

dhardy Aug 14, 2018 • edited

Choose a reason for hiding this comment

dhardy Aug 14, 2018

Choose a reason for hiding this comment

Pazzaz Aug 14, 2018

Choose a reason for hiding this comment

vks Aug 14, 2018

Choose a reason for hiding this comment

dhardy Aug 14, 2018

Choose a reason for hiding this comment

AppVeyorBot commented Aug 14, 2018

Pazzaz commented Aug 5, 2018 •

edited

dhardy commented Aug 6, 2018 •

edited

dhardy left a comment •

edited

dhardy Aug 14, 2018 •

edited