Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/Rework benchmarks to track initialization cost #272

Merged
merged 1 commit into from Jul 13, 2022
Merged

Conversation

josephlr
Copy link
Member

This PR adds more benchmarks so we can get and accurate idea about two
things:

  • What is the cost of having to zero the buffer before calling
    getrandom?
  • What is the performance on aligned, 32-byte buffers?
    • This is by far the most common use, as its used to seed
      usersapce CSPRNGs.

I ran the benchmarks on my system:

  • CPU: AMD Ryzen 7 5700G
  • OS: Linux 5.15.52-1-lts
  • Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07)

I got the following results:

test bench_large      ... bench:   3,759,323 ns/iter (+/- 177,100) = 557 MB/s
test bench_large_init ... bench:   3,821,229 ns/iter (+/- 39,132) = 548 MB/s
test bench_page       ... bench:       7,281 ns/iter (+/- 59) = 562 MB/s
test bench_page_init  ... bench:       7,290 ns/iter (+/- 69) = 561 MB/s
test bench_seed       ... bench:         206 ns/iter (+/- 3) = 155 MB/s
test bench_seed_init  ... bench:         206 ns/iter (+/- 1) = 155 MB/s

These results were very consistent across multiple runs, and roughtly
behave as we would expect:

  • The thoughput is highest with a buffer large enough to amoritize the
    syscall overhead, but small enough to stay in the L1D cache.
  • There is a very small cost to zeroing the buffer beforehand.
  • This cost is imperceptible in the common 32-byte usecase, where the
    syscall overhead dominates.
  • The cost is slightly higher (1%) with multi-megabyte buffers as the
    data gets evicted from the L1 cache between the memset and the
    call to getrandom.

I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?

Signed-off-by: Joe Richey joerichey@google.com

This PR adds more benchmarks so we can get and accurate idea about two
things:

  - What is the cost of having to zero the buffer before calling
    `getrandom`?
  - What is the performance on aligned, 32-byte buffers?
    - This is by far the most common use, as its used to seed
      usersapce CSPRNGs.

I ran the benchmarks on my system:
  - CPU: AMD Ryzen 7 5700G
  - OS: Linux 5.15.52-1-lts
  - Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07)

I got the following results:
```
test bench_large      ... bench:   3,759,323 ns/iter (+/- 177,100) = 557 MB/s
test bench_large_init ... bench:   3,821,229 ns/iter (+/- 39,132) = 548 MB/s
test bench_page       ... bench:       7,281 ns/iter (+/- 59) = 562 MB/s
test bench_page_init  ... bench:       7,290 ns/iter (+/- 69) = 561 MB/s
test bench_seed       ... bench:         206 ns/iter (+/- 3) = 155 MB/s
test bench_seed_init  ... bench:         206 ns/iter (+/- 1) = 155 MB/s
```

These results were very consistent across multiple runs, and roughtly
behave as we would expect:
  - The thoughput is highest with a buffer large enough to amoritize the
    syscall overhead, but small enough to stay in the L1D cache.
  - There is a _very_ small cost to zeroing the buffer beforehand.
  - This cost is imperceptible in the common 32-byte usecase, where the
    syscall overhead dominates.
  - The cost is slightly higher (1%) with multi-megabyte buffers as the
    data gets evicted from the L1 cache between the `memset` and the
    call to `getrandom`.

I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?

Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr josephlr requested a review from newpavlov July 10, 2022 07:31
@josephlr josephlr marked this pull request as ready for review July 10, 2022 07:31
@josephlr
Copy link
Member Author

josephlr commented Jul 10, 2022

I also locally patched the crate to use the rdrand implementation and got:

test bench_large      ... bench:   4,152,659 ns/iter (+/- 29,449) = 505 MB/s
test bench_large_init ... bench:   4,232,638 ns/iter (+/- 48,649) = 495 MB/s
test bench_page       ... bench:       8,120 ns/iter (+/- 72) = 504 MB/s
test bench_page_init  ... bench:       8,156 ns/iter (+/- 65) = 502 MB/s
test bench_seed       ... bench:          63 ns/iter (+/- 0) = 507 MB/s
test bench_seed_init  ... bench:          66 ns/iter (+/- 0) = 484 MB/s

again, these results were quite stable over multiple runs, showing a small improvement from not having to initialize the buffer.

For this and the above x86_64 Linux benchmark, I used

RUSTFLAGS="-C opt-level=3 -C codegen-units=1 -C embed-bitcode=yes -C lto=fat -C target-cpu=native"

@josephlr
Copy link
Member Author

josephlr commented Jul 13, 2022

@newpavlov anything blocking merging in these benchmarks? If we merge them in, it will be easier for people to run them on different platforms. This will, in turn, make it easier to figure out if #226 and #271 are worth it.

@josephlr
Copy link
Member Author

On another system:

  • CPU: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
  • OS: Linux 5.17.11-1
  • Rust Version: rustc 1.64.0-nightly (1c7b36d4d 2022-07-12)

Linux implementation (default):

test bench_large      ... bench:   4,703,785 ns/iter (+/- 428,152) = 445 MB/s
test bench_large_init ... bench:   4,728,995 ns/iter (+/- 717,649) = 443 MB/s
test bench_page       ... bench:       9,816 ns/iter (+/- 88) = 417 MB/s
test bench_page_init  ... bench:       9,852 ns/iter (+/- 142) = 415 MB/s
test bench_seed       ... bench:         689 ns/iter (+/- 7) = 46 MB/s
test bench_seed_init  ... bench:         689 ns/iter (+/- 9) = 46 MB/s

RDRAND implementation (patched):

test bench_large      ... bench:  32,015,164 ns/iter (+/- 148,769) = 65 MB/s
test bench_large_init ... bench:  32,053,932 ns/iter (+/- 118,142) = 65 MB/s
test bench_page       ... bench:      62,012 ns/iter (+/- 647) = 66 MB/s
test bench_page_init  ... bench:      62,674 ns/iter (+/- 290) = 65 MB/s
test bench_seed       ... bench:         490 ns/iter (+/- 6) = 65 MB/s
test bench_seed_init  ... bench:         492 ns/iter (+/- 7) = 65 MB/s

Again, the difference is detectable, but very, very small.

@josephlr
Copy link
Member Author

On a aarch64-unknown-linux-musl system:

  • GCE ARM64 VM t2a-standard-4 (4 vCPU, Ampere Altra)
  • OS: Debian Linux 5.18.0-0.bpo.1-cloud-arm64
  • Rust Version: rustc 1.64.0-nightly (1c7b36d4d 2022-07-12)

Linux implementation:

test bench_large      ... bench:   4,826,323 ns/iter (+/- 63,950) = 434 MB/s
test bench_large_init ... bench:   4,871,679 ns/iter (+/- 46,888) = 430 MB/s
test bench_page       ... bench:       9,718 ns/iter (+/- 128) = 421 MB/s
test bench_page_init  ... bench:       9,816 ns/iter (+/- 197) = 417 MB/s
test bench_seed       ... bench:         329 ns/iter (+/- 3) = 97 MB/s
test bench_seed_init  ... bench:         331 ns/iter (+/- 5) = 96 MB/s

@josephlr josephlr merged commit 7089766 into master Jul 13, 2022
@josephlr josephlr deleted the bench branch July 13, 2022 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants