Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Uniform Duration sampling. #583

Merged
merged 6 commits into from Aug 23, 2018
Merged

Conversation

Pazzaz
Copy link
Contributor

@Pazzaz Pazzaz commented Aug 5, 2018

These changes speed up the process of sampling Durations from a uniform distribution primarily by avoiding two expensive operations:

  • Creating a Duration (now only done once per sample call)
  • Addition of Durations (now never done)

I also added some simple benchmarks to measure these changes.
Before:

test distr_uniform_duration_large   ... bench:      13,007 ns/iter (+/- 189) = 1230 MB/s
test distr_uniform_duration_largest ... bench:       9,263 ns/iter (+/- 136) = 1727 MB/s
test distr_uniform_duration_one     ... bench:      12,346 ns/iter (+/- 353) = 1295 MB/s
test distr_uniform_duration_variety ... bench:      12,371 ns/iter (+/- 130) = 1293 MB/s

After:

test distr_uniform_duration_large   ... bench:      12,523 ns/iter (+/- 41) = 1277 MB/s
test distr_uniform_duration_largest ... bench:       8,395 ns/iter (+/- 102) = 1905 MB/s
test distr_uniform_duration_one     ... bench:       9,907 ns/iter (+/- 422) = 1615 MB/s
test distr_uniform_duration_variety ... bench:      10,504 ns/iter (+/- 90) = 1523 MB/s

To really avoid any regressions, I also used a more comprehensive benchmark.
The code for those benchmarks (rust to bench and python to graph) can be found here.

Comprehensive Benchmark Results

In these heatmaps, the x axis is the upper bound and the y axis is the lower bound. 800 different Durations were used, ranging from 0s to 18446744074.7095512s, in order.
The z-value is the average number of nanoseconds it took to sample from the corresponding distribution.
Before:

screenshot from 2018-08-04 21-16-25

After:
screenshot from 2018-08-04 21-16-36
Percentage of improvement:
screenshot from 2018-08-05 21-58-40
Zoomed in:
screenshot from 2018-08-04 21-15-40
So about a 10-20% improvement and the values outside of that range seem to be just noise.

Edit: changed a call to Duration::from_nanos to Duration::new(nanos / 1_000_000_000, (nanos % 1_000_000_000) as u32) for compatibility with older rust versions. Shouldn't make much of a difference but the benchmark can be rerun.

@dhardy
Copy link
Member

dhardy commented Aug 6, 2018

Looks good, though there might be some ranges which are tricky to sample (e.g. 4.999, 5.001). Since such ranges are (probably) unlikely, this may not be a big issue.

This does bring up two points though:

  • algorithm complexity (unbounded) should be mentioned
  • Is it better to optimise for worst or average case? Arguably in general average, but where worst-case performance is terrible

Adding 1e9 - low_n everywhere reduces the size of the rejection zone at the cost of a little extra arithmetic. Doing this and using the Small mode optimisation guarantees that the probability of rejecting a sample is less than half, which is probably good enough since then the expectation of really terrible worst-case performance is negligible.

@dhardy
Copy link
Member

dhardy commented Aug 10, 2018

By the way: your extended benchmarks show a significant performance regression for one set of bounds (I don't know what these are). I'm not keen on accepting this as-is due to the potential for very poor performance on certain bounds (but if we did, then there should definitely be a benchmark highlighting the problem).

@Pazzaz
Copy link
Contributor Author

Pazzaz commented Aug 14, 2018

I hadn't really considered the worst case. I think it should have a much better worst case performance now. I reduced the rejection zone by using an offset of -low_n in the Large case and made Small applicable in more cases. I've spent quite some time trying to optimize this further but I think it's good enough now.

Current bench-difference
old:

running 5 tests
test distr_uniform_duration_edge    ... bench:      13,522 ns/iter (+/- 59) = 1183 MB/s
test distr_uniform_duration_large   ... bench:      12,853 ns/iter (+/- 100) = 1244 MB/s
test distr_uniform_duration_largest ... bench:       8,945 ns/iter (+/- 17) = 1788 MB/s
test distr_uniform_duration_one     ... bench:      12,555 ns/iter (+/- 142) = 1274 MB/s
test distr_uniform_duration_variety ... bench:      12,555 ns/iter (+/- 185) = 1274 MB/s

new:

running 5 tests
test distr_uniform_duration_edge    ... bench:       5,641 ns/iter (+/- 69) = 2836 MB/s
test distr_uniform_duration_large   ... bench:      12,517 ns/iter (+/- 33) = 1278 MB/s
test distr_uniform_duration_largest ... bench:       8,352 ns/iter (+/- 32) = 1915 MB/s
test distr_uniform_duration_one     ... bench:       9,942 ns/iter (+/- 36) = 1609 MB/s
test distr_uniform_duration_variety ... bench:      12,085 ns/iter (+/- 185) = 1323 MB/s

One thing that is a little weird right now is that I've set the offset used in the Large case as a field in the struct instead of as a field on the enum itself. This is simply because I found it to improve the benchmarks.

your extended benchmarks show a significant performance regression for one set of bounds

I think that's just noise caused by my computer prioritizing something else.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Unfortunately it's also wrong; I'm surprised it didn't panic in your tests! Should be easy to fix though.

high_n = high_n + 1_000_000_000;
} else {
high_s = high_s;
high_n = high_n;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant

Uniform::new(Duration::new(10000, 423423), Duration::new(200000, 6969954))
);
distr_duration!(distr_uniform_duration_edge,
Uniform::new_inclusive(Duration::new(u64::max_value() / 10, 999_999_999), Duration::new((u64::max_value() / 10) + 1, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearer if you put the "max / 10" bit above (let binding); you can use any value > 4 (to avoid the Medium sampling method). Edge case with high_ns = 1 may be worse in some implementations and is probably more important to check (keeping low_ns the same).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Medium sampling method actually occurs for any value > (2^64)/(10^9). But I simplified it by using a const binding and the same number of seconds in distr_uniform_duration_edge and distr_uniform_duration_large

mode: UniformDurationMode,
offset: u32,
Copy link
Member

@dhardy dhardy Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset is only used by Large mode so can be moved there.

Edit: sorry, I see your comment about this. Fine. It should make no difference to the size of the struct either way.

match self.mode {
UniformDurationMode::Small { secs, nanos } => {
let n = nanos.sample(rng);
Duration::new(secs, n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is now incorrect since it's possible that n > 10^9 (in this case new should panic). Hopefully using an if to check here won't have too big an impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration::new() actually handles that already. In the documentation it says:

If the number of nanoseconds is greater than 1 billion (the number of nanoseconds in a second), then it will carry over into the seconds provided.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a debug_assert?

let s = secs.sample(rng);
let n = nano_range.sample(rng);
if !(s == max_secs && n > max_nanos) {
let sum = n + self.offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding the offset back, s * 1_000_000_000 + sum is correct, but again sum may be more than 10^9. Again, using an if to check whether to decrement sum and increment s is probably the best option.

@AppVeyorBot
Copy link

Build rand 1.0.59 failed (commit 970720fb41 by @Pazzaz)

@dhardy dhardy merged commit ea467e7 into rust-random:master Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants