Add/Rework benchmarks to track initialization cost #272

josephlr · 2022-07-10T07:31:55Z

This PR adds more benchmarks so we can get and accurate idea about two
things:

What is the cost of having to zero the buffer before calling
getrandom?
What is the performance on aligned, 32-byte buffers?
- This is by far the most common use, as its used to seed
  usersapce CSPRNGs.

I ran the benchmarks on my system:

CPU: AMD Ryzen 7 5700G
OS: Linux 5.15.52-1-lts
Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07)

I got the following results:

test bench_large      ... bench:   3,759,323 ns/iter (+/- 177,100) = 557 MB/s
test bench_large_init ... bench:   3,821,229 ns/iter (+/- 39,132) = 548 MB/s
test bench_page       ... bench:       7,281 ns/iter (+/- 59) = 562 MB/s
test bench_page_init  ... bench:       7,290 ns/iter (+/- 69) = 561 MB/s
test bench_seed       ... bench:         206 ns/iter (+/- 3) = 155 MB/s
test bench_seed_init  ... bench:         206 ns/iter (+/- 1) = 155 MB/s

These results were very consistent across multiple runs, and roughtly
behave as we would expect:

The thoughput is highest with a buffer large enough to amoritize the
syscall overhead, but small enough to stay in the L1D cache.
There is a very small cost to zeroing the buffer beforehand.
This cost is imperceptible in the common 32-byte usecase, where the
syscall overhead dominates.
The cost is slightly higher (1%) with multi-megabyte buffers as the
data gets evicted from the L1 cache between the memset and the
call to getrandom.

I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?

Signed-off-by: Joe Richey joerichey@google.com

This PR adds more benchmarks so we can get and accurate idea about two things: - What is the cost of having to zero the buffer before calling `getrandom`? - What is the performance on aligned, 32-byte buffers? - This is by far the most common use, as its used to seed usersapce CSPRNGs. I ran the benchmarks on my system: - CPU: AMD Ryzen 7 5700G - OS: Linux 5.15.52-1-lts - Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07) I got the following results: ``` test bench_large ... bench: 3,759,323 ns/iter (+/- 177,100) = 557 MB/s test bench_large_init ... bench: 3,821,229 ns/iter (+/- 39,132) = 548 MB/s test bench_page ... bench: 7,281 ns/iter (+/- 59) = 562 MB/s test bench_page_init ... bench: 7,290 ns/iter (+/- 69) = 561 MB/s test bench_seed ... bench: 206 ns/iter (+/- 3) = 155 MB/s test bench_seed_init ... bench: 206 ns/iter (+/- 1) = 155 MB/s ``` These results were very consistent across multiple runs, and roughtly behave as we would expect: - The thoughput is highest with a buffer large enough to amoritize the syscall overhead, but small enough to stay in the L1D cache. - There is a _very_ small cost to zeroing the buffer beforehand. - This cost is imperceptible in the common 32-byte usecase, where the syscall overhead dominates. - The cost is slightly higher (1%) with multi-megabyte buffers as the data gets evicted from the L1 cache between the `memset` and the call to `getrandom`. I would love to see results for other platforms. Could we get someone to run this on an M1 Mac? Signed-off-by: Joe Richey <joerichey@google.com>

josephlr · 2022-07-10T08:42:14Z

I also locally patched the crate to use the rdrand implementation and got:

test bench_large      ... bench:   4,152,659 ns/iter (+/- 29,449) = 505 MB/s
test bench_large_init ... bench:   4,232,638 ns/iter (+/- 48,649) = 495 MB/s
test bench_page       ... bench:       8,120 ns/iter (+/- 72) = 504 MB/s
test bench_page_init  ... bench:       8,156 ns/iter (+/- 65) = 502 MB/s
test bench_seed       ... bench:          63 ns/iter (+/- 0) = 507 MB/s
test bench_seed_init  ... bench:          66 ns/iter (+/- 0) = 484 MB/s

again, these results were quite stable over multiple runs, showing a small improvement from not having to initialize the buffer.

For this and the above x86_64 Linux benchmark, I used

RUSTFLAGS="-C opt-level=3 -C codegen-units=1 -C embed-bitcode=yes -C lto=fat -C target-cpu=native"

josephlr · 2022-07-13T00:15:36Z

@newpavlov anything blocking merging in these benchmarks? If we merge them in, it will be easier for people to run them on different platforms. This will, in turn, make it easier to figure out if #226 and #271 are worth it.

josephlr · 2022-07-13T02:38:35Z

On another system:

CPU: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
OS: Linux 5.17.11-1
Rust Version: rustc 1.64.0-nightly (1c7b36d4d 2022-07-12)

Linux implementation (default):

test bench_large      ... bench:   4,703,785 ns/iter (+/- 428,152) = 445 MB/s
test bench_large_init ... bench:   4,728,995 ns/iter (+/- 717,649) = 443 MB/s
test bench_page       ... bench:       9,816 ns/iter (+/- 88) = 417 MB/s
test bench_page_init  ... bench:       9,852 ns/iter (+/- 142) = 415 MB/s
test bench_seed       ... bench:         689 ns/iter (+/- 7) = 46 MB/s
test bench_seed_init  ... bench:         689 ns/iter (+/- 9) = 46 MB/s

RDRAND implementation (patched):

test bench_large      ... bench:  32,015,164 ns/iter (+/- 148,769) = 65 MB/s
test bench_large_init ... bench:  32,053,932 ns/iter (+/- 118,142) = 65 MB/s
test bench_page       ... bench:      62,012 ns/iter (+/- 647) = 66 MB/s
test bench_page_init  ... bench:      62,674 ns/iter (+/- 290) = 65 MB/s
test bench_seed       ... bench:         490 ns/iter (+/- 6) = 65 MB/s
test bench_seed_init  ... bench:         492 ns/iter (+/- 7) = 65 MB/s

Again, the difference is detectable, but very, very small.

josephlr · 2022-07-13T03:05:01Z

On a aarch64-unknown-linux-musl system:

GCE ARM64 VM t2a-standard-4 (4 vCPU, Ampere Altra)
OS: Debian Linux 5.18.0-0.bpo.1-cloud-arm64
Rust Version: rustc 1.64.0-nightly (1c7b36d4d 2022-07-12)

Linux implementation:

test bench_large      ... bench:   4,826,323 ns/iter (+/- 63,950) = 434 MB/s
test bench_large_init ... bench:   4,871,679 ns/iter (+/- 46,888) = 430 MB/s
test bench_page       ... bench:       9,718 ns/iter (+/- 128) = 421 MB/s
test bench_page_init  ... bench:       9,816 ns/iter (+/- 197) = 417 MB/s
test bench_seed       ... bench:         329 ns/iter (+/- 3) = 97 MB/s
test bench_seed_init  ... bench:         331 ns/iter (+/- 5) = 96 MB/s

josephlr requested a review from newpavlov July 10, 2022 07:31

josephlr marked this pull request as ready for review July 10, 2022 07:31

josephlr mentioned this pull request Jul 10, 2022

Added possibility to pass uninit arrays to random generator #271

Closed

newpavlov approved these changes Jul 13, 2022

View reviewed changes

josephlr merged commit 7089766 into master Jul 13, 2022

josephlr deleted the bench branch July 13, 2022 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add/Rework benchmarks to track initialization cost #272

Add/Rework benchmarks to track initialization cost #272

josephlr commented Jul 10, 2022

josephlr commented Jul 10, 2022 •

edited

josephlr commented Jul 13, 2022 •

edited

josephlr commented Jul 13, 2022

josephlr commented Jul 13, 2022

Add/Rework benchmarks to track initialization cost #272

Add/Rework benchmarks to track initialization cost #272

Conversation

josephlr commented Jul 10, 2022

josephlr commented Jul 10, 2022 • edited

josephlr commented Jul 13, 2022 • edited

josephlr commented Jul 13, 2022

josephlr commented Jul 13, 2022

josephlr commented Jul 10, 2022 •

edited

josephlr commented Jul 13, 2022 •

edited