Implement much faster sha256 and sha512. #41

0xdeafbeef · 2021-08-29T18:11:29Z

I took sha256 and sha512 variants from linux sources.
On AMD Ryzen 9 5900HS comparing

cargo bench

with

RUSTFLAGS=-Ctarget-feature=+avx2,+aes cargo bench

gives such results:

sha256                  time:   [31.047 ns 31.065 ns 31.083 ns]                    
                        change: [-79.294% -79.275% -79.257%] (p = 0.00 < 0.05)
                        Performance has improved.

sha512                  time:   [135.58 ns 135.79 ns 136.01 ns]                   
                        change: [-34.078% -33.749% -33.500%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

Closes #5

sha2/build.rs

tarcieri · 2021-08-29T18:33:45Z

Looks very interesting, thanks! Left some notes.

0xdeafbeef · 2021-08-29T19:37:31Z

BTW, is Cargo.lock required?

tarcieri · 2021-08-29T19:53:09Z

BTW, is Cargo.lock required?

What do you mean by that?

newpavlov

Thank you! Looks interesting indeed.

sha2/Cargo.toml

sha2/build.rs

sha2/src/lib.rs

0xdeafbeef · 2021-08-29T20:21:21Z

BTW, is Cargo.lock required?

What do you mean by that?
Why Cargo.lock is kept in library? To pin cc version?

tarcieri · 2021-08-29T20:34:07Z

@0xdeafbeef it makes the build deterministic, which makes it easier to spot problems arising from particular dependency changes.

It's something we do across the board, although perhaps there are repos like this one which it makes less sense for.

tarcieri · 2021-08-30T15:50:08Z

@0xdeafbeef did you say you compared the core::arch intrinsics version for SHA-NI to the ASM?

If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.

0xdeafbeef · 2021-08-31T12:57:41Z

@0xdeafbeef did you say you compared the core::arch intrinsics version for SHA-NI to the ASM?

If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.

Speed is the same. I think we should include it because if somebody uses asm feature, then he'll get much slower implementation then without it.

tarcieri · 2021-08-31T13:31:41Z

Since we already have the intrinsic code in the sha2 crate, we can detect the sha extension there and use it if available, only then falling back onto the asm if it isn't available, i.e. SHA-NI intrinsics should be a higher precedence than asm, which AFAIK is how it already works.

Otherwise, there is duplication of the feature across the sha2 and sha2-asm crates.

tarcieri · 2021-09-04T17:56:54Z

Hmm, build failure seems unrelated I think?

sha2/build.rs

tarcieri

Looks good. One minor suggestion.

tarcieri · 2021-09-05T15:04:16Z

@0xdeafbeef can you rebase? I think #42 should've taken care of the build failures.

newpavlov

Overall, I think it looks good for merging. I have only two nits and it would be nice to rebase it first.

sha2/Cargo.toml

sha2/src/lib.rs

Fix

newpavlov · 2021-09-08T18:38:34Z

sha2/src/lib.rs

 extern "C" {
    fn sha256_compress(state: &mut [u32; 8], block: &[u8; 64]);
+    fn sha256_transform_rorx(state: &mut [u32; 8], block: *const [u8; 64], num_blocks: u64);


You forgot to change num_blocks to usize here. Also we probably should change sha256_compress to explicit pointer and length as well (same for sha512_compress). IIRC memory layout of slices is not guaranteed.

It seems like it's guaranteed

Your link talks about layout of slice itself (i.e. about how elements of a slice a stored in memory). In this context it's more about ABI guarantees, i.e. I don't think it's currently guaranteed that val: &[u8; 16] is equivalent to val_ptr: *const [u8; 16], len: usize when used in extern "C" fns. Can you please modify the signature just to be extra safe?

sha2/src/lib.rs

newpavlov · 2021-09-08T18:44:02Z

sha2/src/lib.rs

@@ -13,23 +13,37 @@
 #[cfg(not(any(target_arch = "x86_64", target_arch = "x86", target_arch = "aarch64")))]
 compile_error!("crate can only be used on x86, x86-64 and aarch64 architectures");

+cpufeatures::new!(cpuid_avx2, "avx2");


Gate this line on #[cfg(any(target_arch = "x86_64", target_arch = "x86"))]. Otherwise it causes compilation failure on Aarch64 targets.

You forgot to modify the compress256 function (see the CI failure). Currently it tries to use the cpuid_avx2 module on all targets. I think the easiest solution would be to introduce two function with the same name one gated on x86(-64) and another one on AArch64.

newpavlov · 2021-09-08T22:35:08Z

BTW could you also compare performance of the AVX2 based assembly with the intrinsics-based implementation from RustCrypto/hashes#312?

0xdeafbeef · 2021-09-09T11:06:18Z

asm

test bench1_10    ... bench:          20 ns/iter (+/- 2) = 500 MB/s
test bench2_100   ... bench:         164 ns/iter (+/- 10) = 609 MB/s
test bench3_1000  ... bench:       1,451 ns/iter (+/- 135) = 689 MB/s
test bench4_10000 ... bench:      14,165 ns/iter (+/- 1,319) = 705 MB/s

intrinsic

running 4 tests
test bench1_10    ... bench:          20 ns/iter (+/- 5) = 500 MB/s
test bench2_100   ... bench:         162 ns/iter (+/- 10) = 617 MB/s
test bench3_1000  ... bench:       1,408 ns/iter (+/- 159) = 710 MB/s
test bench4_10000 ... bench:      13,448 ns/iter (+/- 838) = 743 MB/s

Force soft.

running 4 tests
test bench1_10    ... bench:          23 ns/iter (+/- 4) = 434 MB/s
test bench2_100   ... bench:         196 ns/iter (+/- 23) = 510 MB/s
test bench3_1000  ... bench:       1,926 ns/iter (+/- 144) = 519 MB/s
test bench4_10000 ... bench:      18,350 ns/iter (+/- 1,070) = 544 MB/s

I think that asm version is not needed anymore.
Good job, @Rexagon!

0xdeafbeef · 2021-09-09T11:14:52Z

After pinning to the same core
asm

running 4 tests
test bench1_10    ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test bench2_100   ... bench:         152 ns/iter (+/- 3) = 657 MB/s
test bench3_1000  ... bench:       1,339 ns/iter (+/- 28) = 746 MB/s
test bench4_10000 ... bench:      13,041 ns/iter (+/- 343) = 766 MB/s

intrinsic

running 4 tests
test bench1_10    ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test bench2_100   ... bench:         148 ns/iter (+/- 3) = 675 MB/s
test bench3_1000  ... bench:       1,276 ns/iter (+/- 30) = 783 MB/s
test bench4_10000 ... bench:      12,420 ns/iter (+/- 275) = 805 MB/s

@newpavlov should I close pr?

newpavlov · 2021-09-09T11:38:49Z

Hm, I am not 100% sure. Some may prefer the assembly implementation from reliability point of view, since with an intrinsics-based implementation we at the mercy of the compiler and in some cases achieved performance can be brittle. From another point of view, people usually expect that an assembly implementation is faster than a "software" one.

@tarcieri
What do you think?

tarcieri · 2021-09-09T16:13:00Z

Yeah, it's definitely a tradeoff. I think the biggest risk is actually miscompilation (see e.g. rust-lang/rust#79865).

That said I'd weakly be in favor of an all-intrinsics approach if performance is comparable to assembly. I think that better fits the philosophy of "Rust Crypto", and unless there are big performance wins with ASM it's probably best avoided, at least within the crates we maintain.

A pure Rust approach solves a lot of problems, especially relating to portability. Relevant: RustCrypto/hashes#315

newpavlov · 2021-09-10T12:09:21Z

I also lean towards the stance "assembly impls only for sufficient performance improvements", so I guess we can close this PR.

@0xdeafbeef
Thank you for you contribution (at the very least I think it was a trigger for the AVX2 impl) and sorry this PR ended like this!

0xdeafbeef added 4 commits August 29, 2021 19:49

Add sha256-avx2 and sha256-ni

57ef977

Implement fast SHA-512 with AVX2 instructions

354bc0c

Update build.rs

d80c79a

Cleanup

c7dd9e5

tarcieri reviewed Aug 29, 2021

View reviewed changes

sha2/build.rs Outdated Show resolved Hide resolved

tarcieri reviewed Aug 29, 2021

View reviewed changes

sha2/build.rs Outdated Show resolved Hide resolved

tarcieri mentioned this pull request Aug 29, 2021

Migrate to assembly from OpenSSL #5

Closed

Add runtime feature detection. Update build

c705ae2

tarcieri requested a review from newpavlov August 29, 2021 19:30

newpavlov reviewed Aug 29, 2021

View reviewed changes

sha2/Cargo.toml Outdated Show resolved Hide resolved

sha2/build.rs Outdated Show resolved Hide resolved

sha2/src/lib.rs Show resolved Hide resolved

0xdeafbeef added 2 commits August 30, 2021 11:14

Refactor

e17dee4

Fix not linux os build

b685df3

Update

9ad48b8

0xdeafbeef requested a review from tarcieri September 4, 2021 14:18

tarcieri reviewed Sep 4, 2021

View reviewed changes

sha2/build.rs Outdated Show resolved Hide resolved

tarcieri approved these changes Sep 4, 2021

View reviewed changes

Refactor build script

efec464

newpavlov reviewed Sep 5, 2021

View reviewed changes

sha2/Cargo.toml Outdated Show resolved Hide resolved

sha2/src/lib.rs Outdated Show resolved Hide resolved

newpavlov mentioned this pull request Sep 6, 2021

sha2: intrinsics based AVX2 backend RustCrypto/hashes#311

Closed

0xdeafbeef requested a review from newpavlov September 6, 2021 11:10

Small fix

99627cf

Fix

0xdeafbeef force-pushed the master branch from b4a307d to 99627cf Compare September 6, 2021 11:27

newpavlov reviewed Sep 8, 2021

View reviewed changes

sha2/src/lib.rs Outdated Show resolved Hide resolved

newpavlov reviewed Sep 8, 2021

View reviewed changes

Fix issues

43813ab

0xdeafbeef requested a review from newpavlov September 8, 2021 20:09

tarcieri mentioned this pull request Sep 10, 2021

MSVC support #17

Open

newpavlov closed this Sep 10, 2021

KyleRicardo mentioned this pull request Jun 22, 2022

Consider remove 'asm' feature or offer an option to disable it shadowsocks/shadowsocks-crypto#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement much faster sha256 and sha512. #41

Implement much faster sha256 and sha512. #41

0xdeafbeef commented Aug 29, 2021

tarcieri commented Aug 29, 2021

0xdeafbeef commented Aug 29, 2021

tarcieri commented Aug 29, 2021

newpavlov left a comment •

edited

0xdeafbeef commented Aug 29, 2021

tarcieri commented Aug 29, 2021 •

edited

tarcieri commented Aug 30, 2021

0xdeafbeef commented Aug 31, 2021

tarcieri commented Aug 31, 2021

tarcieri commented Sep 4, 2021

tarcieri left a comment

tarcieri commented Sep 5, 2021

newpavlov left a comment

newpavlov Sep 8, 2021

0xdeafbeef Sep 8, 2021

newpavlov Sep 8, 2021 •

edited

newpavlov Sep 8, 2021

newpavlov Sep 8, 2021

newpavlov commented Sep 8, 2021

0xdeafbeef commented Sep 9, 2021

0xdeafbeef commented Sep 9, 2021 •

edited

newpavlov commented Sep 9, 2021

tarcieri commented Sep 9, 2021 •

edited

newpavlov commented Sep 10, 2021

Implement much faster sha256 and sha512. #41

Implement much faster sha256 and sha512. #41

Conversation

0xdeafbeef commented Aug 29, 2021

tarcieri commented Aug 29, 2021

0xdeafbeef commented Aug 29, 2021

tarcieri commented Aug 29, 2021

newpavlov left a comment • edited

Choose a reason for hiding this comment

0xdeafbeef commented Aug 29, 2021

tarcieri commented Aug 29, 2021 • edited

tarcieri commented Aug 30, 2021

0xdeafbeef commented Aug 31, 2021

tarcieri commented Aug 31, 2021

tarcieri commented Sep 4, 2021

tarcieri left a comment

Choose a reason for hiding this comment

tarcieri commented Sep 5, 2021

newpavlov left a comment

Choose a reason for hiding this comment

newpavlov Sep 8, 2021

Choose a reason for hiding this comment

0xdeafbeef Sep 8, 2021

Choose a reason for hiding this comment

newpavlov Sep 8, 2021 • edited

Choose a reason for hiding this comment

newpavlov Sep 8, 2021

Choose a reason for hiding this comment

newpavlov Sep 8, 2021

Choose a reason for hiding this comment

newpavlov commented Sep 8, 2021

0xdeafbeef commented Sep 9, 2021

0xdeafbeef commented Sep 9, 2021 • edited

newpavlov commented Sep 9, 2021

tarcieri commented Sep 9, 2021 • edited

newpavlov commented Sep 10, 2021

newpavlov left a comment •

edited

tarcieri commented Aug 29, 2021 •

edited

newpavlov Sep 8, 2021 •

edited

0xdeafbeef commented Sep 9, 2021 •

edited

tarcieri commented Sep 9, 2021 •

edited