all: use SHA256 with SIMD instructions for higher performance and throughout #700

odeke-em · 2022-06-07T02:47:07Z

In this repository, we heavily use the Go standard library's crypto/sha256. However there exists a Single Instruction Multiple Data (SIMD) package from our friends at Minio per https://github.com/minio/sha256-simd and it promises 8X speed ups when using AVX instructions. We should explore this.

Let's explore if performance radically improves and then plumb it in.

Kindly cc-ing my colleague @elias-orijtech

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

tac0turtle · 2022-06-08T10:28:19Z

Is it okay to assign this to you and your team @odeke-em

odeke-em · 2022-06-08T15:27:05Z

Is it okay to assign this to you and your team @odeke-em

Yes, please @marbar3778! We are working on it. I just need to find a machine with AVX512 so that we can produce benchmarks.

ValarDragon · 2022-06-29T17:23:23Z

In support of using that library! Though I think its probably advisable to turn off AVX 512 via build flag, given the SDK workload (https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/)

itsdevbear · 2022-10-03T19:54:07Z

+1

kirbyquerby · 2023-02-15T23:57:50Z

Also of interest for this issue are a number of occurrences of crypto.Sha256():

screenshot

These come from what appears to be a helper function that wraps crypto/sha256:

https://github.com/cometbft/cometbft/blob/e9b91405b643b46b011865c4b7e1c1af0aa5c521/crypto/hash.go#L7-L11

We'd probably want to either replace these usages or update cometbft to use the SIMD library as well.

tac0turtle · 2023-02-15T23:59:13Z

thanks for the insight, i would advocate for replacing the wrapped function as we are trying to rely less on comet

yihuang · 2023-02-16T07:05:42Z

The last time I check it, I don't see much improvements on dev machines I got at hand (x86_64 mac laptop and arm64 linux), on mac the stdlib is actually much faster, I just rerun the benchmark with go1.20 and post the result as follows:

arm64 linux

~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: linux
goarch: arm64
pkg: github.com/minio/sha256-simd
BenchmarkHash/Generic/8Bytes-8         	 2184978	       549.6 ns/op	  14.56 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/64Bytes-8        	 1000000	      1064 ns/op	  60.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1K-8             	  139132	      8623 ns/op	 118.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/8K-8             	   18447	     65101 ns/op	 125.83 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1M-8             	     144	   8288227 ns/op	 126.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/5M-8             	      28	  41402281 ns/op	 126.63 MB/s	       3 B/op	       0 allocs/op
BenchmarkHash/Generic/10M-8            	      14	  82817517 ns/op	 126.61 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/8Bytes-8         	11930301	       100.6 ns/op	  79.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/64Bytes-8        	 7533750	       160.1 ns/op	 399.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/1K-8             	 1547152	       775.6 ns/op	1320.21 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/8K-8             	  224019	      5354 ns/op	1530.03 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/1M-8             	    1789	    670705 ns/op	1563.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/5M-8             	     356	   3352908 ns/op	1563.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/10M-8            	     178	   6706550 ns/op	1563.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-8        	11268408	       106.6 ns/op	  75.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-8       	 8466012	       141.9 ns/op	 450.98 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1K-8            	 1586331	       756.2 ns/op	1354.14 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8K-8            	  224902	      5335 ns/op	1535.60 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1M-8            	    1789	    670623 ns/op	1563.58 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/5M-8            	     356	   3352907 ns/op	1563.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/10M-8           	     178	   6703876 ns/op	1564.13 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	github.com/minio/sha256-simd	31.607s

amd64 mac

~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: darwin
goarch: amd64
pkg: github.com/minio/sha256-simd
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkHash/Generic/8Bytes-12 	 2982602	       410.3 ns/op	  19.50 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/64Bytes-12         	 1540022	       782.3 ns/op	  81.81 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1K-12              	  193633	      6219 ns/op	 164.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/8K-12              	   20944	     49602 ns/op	 165.15 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1M-12              	     202	   6051028 ns/op	 173.29 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/5M-12              	      37	  32201704 ns/op	 162.81 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/10M-12             	      16	  63400945 ns/op	 165.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-12         	 6060865	       188.0 ns/op	  42.56 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-12        	 3442257	       342.0 ns/op	 187.13 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1K-12             	  493141	      2419 ns/op	 423.34 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8K-12             	   66552	     18119 ns/op	 452.12 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1M-12             	     512	   2310553 ns/op	 453.82 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/5M-12             	      99	  11535992 ns/op	 454.48 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/10M-12            	      44	  23383451 ns/op	 448.43 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	github.com/minio/sha256-simd	20.488s

kirbyquerby · 2023-02-16T08:08:14Z

@yihuang I did some digging and it looks like the Go standard library has support for ARM SHA extensions and AVX2, which could explain why GoStdlib and ArmSha2 have such similar performance (Generic falls so far behind because it's an implementation that doesn't use hardware acceleration).

sha256-simd advertises improved performance for processors with Intel SHA Extensions or AVX512, which the standard library doesn't have optimizations for.

I didn't see any improvements for cosmos-sdk benchmarks with the simd library on my workstation, which has Intel SHA Extensions (5950x), but I plan to also benchmark on a machine with AVX512.

yihuang · 2023-02-16T08:14:19Z

actually iavl library use sha256 heavily, should have bigger impact there.

kirbyquerby · 2023-03-01T21:57:57Z

I ran benchmarks for cosmos-sdk and iavl on machines with AVX512 and Intel SHA Extensions with and without using the SIMD library, and got these results: https://gist.github.com/kirbyquerby/6635113b003abdaeaa93618d4e6970a2

There didn't seem to be significant improvements (in many benchmarks, there's even a slowdown) for using the SIMD library in either cosmos-sdk or iavl.

tac0turtle · 2023-03-09T08:46:47Z

would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change

yihuang · 2023-03-09T21:25:33Z

would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change

I can reproduce the intel benchmark result on my mac laptop, it's faster by 6x if you do at least 16 hashing operations in a batch.
but their api assume user always hash 64bytes into 32bytes digest, so it can hard code the padding block, and can do multiple hashes in parallel, for iavl tree:

we don't have the fixed block to hard code
to exploit opportunities of parallel hashing, we need to change our ways of traversing the tree, for example, hashing all the leaf nodes first in a batch, then all the height=1 nodes, etc.

robert-zaremba · 2023-03-10T11:03:05Z

Shall we close this issue, and open new in IAVL if we want to dig more gohashtree usage there?

tac0turtle · 2023-03-10T14:51:20Z

ill transfer this issue there.

but their api assume user always hash 64bytes into 32bytes digest

we can either modify our code or have a variation of their code

odeke-em self-assigned this Jun 8, 2022

tac0turtle transferred this issue from cosmos/cosmos-sdk Mar 10, 2023

tac0turtle unassigned odeke-em Nov 16, 2023

tac0turtle added the T:performance label Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: use SHA256 with SIMD instructions for higher performance and throughout #700

all: use SHA256 with SIMD instructions for higher performance and throughout #700

odeke-em commented Jun 7, 2022

tac0turtle commented Jun 8, 2022

odeke-em commented Jun 8, 2022

ValarDragon commented Jun 29, 2022

itsdevbear commented Oct 3, 2022

kirbyquerby commented Feb 15, 2023

tac0turtle commented Feb 15, 2023

yihuang commented Feb 16, 2023 •

edited

kirbyquerby commented Feb 16, 2023

yihuang commented Feb 16, 2023

kirbyquerby commented Mar 1, 2023 •

edited

tac0turtle commented Mar 9, 2023

yihuang commented Mar 9, 2023 •

edited

robert-zaremba commented Mar 10, 2023

tac0turtle commented Mar 10, 2023 •

edited

all: use SHA256 with SIMD instructions for higher performance and throughout #700

all: use SHA256 with SIMD instructions for higher performance and throughout #700

Comments

odeke-em commented Jun 7, 2022

For Admin Use

tac0turtle commented Jun 8, 2022

odeke-em commented Jun 8, 2022

ValarDragon commented Jun 29, 2022

itsdevbear commented Oct 3, 2022

kirbyquerby commented Feb 15, 2023

tac0turtle commented Feb 15, 2023

yihuang commented Feb 16, 2023 • edited

kirbyquerby commented Feb 16, 2023

yihuang commented Feb 16, 2023

kirbyquerby commented Mar 1, 2023 • edited

tac0turtle commented Mar 9, 2023

yihuang commented Mar 9, 2023 • edited

robert-zaremba commented Mar 10, 2023

tac0turtle commented Mar 10, 2023 • edited

yihuang commented Feb 16, 2023 •

edited

kirbyquerby commented Mar 1, 2023 •

edited

yihuang commented Mar 9, 2023 •

edited

tac0turtle commented Mar 10, 2023 •

edited