Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aes-gcm: performance is worse than OpenSSL #243

Open
kigawas opened this issue Dec 4, 2020 · 24 comments
Open

aes-gcm: performance is worse than OpenSSL #243

kigawas opened this issue Dec 4, 2020 · 24 comments

Comments

@kigawas
Copy link

kigawas commented Dec 4, 2020

As my test via cargo bench, the aes-gcm-256's performance is much worse:

     Running target/release/deps/simple-75040055ea8811ad
Gnuplot not found, using plotters backend
encrypt 100M            time:   [174.63 ms 175.52 ms 176.60 ms]
                        change: [+128.80% +133.74% +138.22%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

decrypt 100M            time:   [137.44 ms 138.20 ms 138.90 ms]
                        change: [+291.03% +294.15% +297.13%] (p = 0.00 < 0.05)
                        Performance has regressed.

It was built with export RUSTFLAGS="-Ctarget-cpu=sandybridge -Ctarget-feature=+aes,+sse2,+sse4.1,+ssse3" as documented.

For OpenSSL:

     Running target/release/deps/simple-8072f89159d02aed
Gnuplot not found, using plotters backend
encrypt 100M            time:   [73.289 ms 73.619 ms 74.055 ms]
                        change: [-2.1748% -0.9188% +0.3013%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

decrypt 100M            time:   [35.428 ms 35.591 ms 35.757 ms]
                        change: [-0.3106% +0.1991% +0.7224%] (p = 0.47 > 0.05)
                        No change in performance detected.

Environment:

iMac (Retina 5K, 27-inch, 2019), 3.7 GHz 6-Core Intel Core i5

@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2020

Presently you need to enable RUSTFLAGS as described here for optimum performance:

https://docs.rs/aes-gcm/0.8.0/aes_gcm/#performance-notes

We are working on and have partially implemented autodetection support for these CPU features which will eliminate the need to manually configure RUSTFLAGS and will be available in the next release.

@kigawas
Copy link
Author

kigawas commented Dec 4, 2020

Well, it was built with RUSTFLAGS.

Surprisingly the performance is approximately 50% in encryption and 30% in decryption compared to OpenSSL.

@tarcieri tarcieri changed the title Performance is much worse than OpenSSL Performance is worse than OpenSSL Dec 4, 2020
@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2020

I'm not sure that much of a difference deserves the qualifier "much".

We've presently been working on features like CPU feature autodetection (which are important) and haven't heavily invested in micro-optimization.

OpenSSL uses heavily optimized hand-written assembly implementations (in the case of AES-GCM, written by cryptography engineers at Intel), so reaching performance parity with those (especially in pure Rust) will be difficult.

@tarcieri tarcieri changed the title Performance is worse than OpenSSL aes-gcm: performance is worse than OpenSSL Dec 4, 2020
@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2020

If anyone would like to work on improving AES-GCM performance, #74 might be a good start

@tarcieri
Copy link
Member

tarcieri commented Dec 8, 2020

Also note: for optimum performance, pass Ctarget-cpu=native.

This will significantly improve performance on Skylake, where LLVM will use the VPCLMULQDQ instruction for GHASH.

@newpavlov
Copy link
Member

In my experience target-cpu=native often results in a degraded performance (one possible explanation is CPU down-clocking due to the AVX2 instructions being used here and there), so I would be careful with it.

@kigawas
Copy link
Author

kigawas commented Dec 9, 2020

Also note: for optimum performance, pass Ctarget-cpu=native.

This will significantly improve performance on Skylake, where LLVM will use the VPCLMULQDQ instruction for GHASH.

I didn't see any statistically significant difference on iMac 2019, thanks anyway :)

encrypt 100M            time:   [177.03 ms 178.43 ms 181.16 ms]
                        change: [-0.7220% +0.7965% +2.4234%] (p = 0.36 > 0.05)
                        No change in performance detected.

Benchmarking encrypt 200M: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 20.0s. You may wish to increase target time to 20.0sor enable flat sampling.
encrypt 200M            time:   [349.32 ms 353.57 ms 356.14 ms]
                        change: [-2.5313% -1.6411% -0.7211%] (p = 0.00 < 0.05)
                        Change within noise threshold.

decrypt 100M            time:   [142.23 ms 143.17 ms 144.19 ms]
                        change: [+0.4705% +1.1593% +1.8840%] (p = 0.01 < 0.05)
                        Change within noise threshold.

decrypt 200M            time:   [286.54 ms 288.30 ms 289.82 ms]
                        change: [-1.3429% -0.2062% +0.7492%] (p = 0.73 > 0.05)
                        No change in performance detected.

@LuoZijun
Copy link

@kigawas

The x86_64 platform, aes-gcm performance is fast then OpenSSL.

maybe you can share your bench code.

My bench code: LuoZijun/crypto-bench

Bench Result:

X86-64:

Cipher OpenSSL Ring Sodium RustCrypto(org) Crypto2
AES-128 470 MB/s N/A N/A 615 MB/s 2666 MB/s ⚡️
AES-128-CCM N/A N/A N/A 81 MB/s 231 MB/s ⚡️
AES-128-GCM 19 MB/s 158 MB/s N/A 122 MB/s 250 MB/s ⚡️
AES-128-GCM-SIV N/A N/A N/A 55 MB/s 110 MB/s ⚡️
AES-128-OCB-TAG128 15 MB/s N/A N/A N/A 216 MB/s ⚡️
AES-128-SIV-CMAC-256 N/A N/A N/A 35 MB/s 296 MB/s ⚡️
AES-256 N/A N/A N/A 444 MB/s 1777 MB/s ⚡️
AES-256-GCM N/A 131 MB/s 61 MB/s 107 MB/s 170 MB/s ⚡️
ChaCha20 N/A N/A N/A 695 MB/s ⚡️ 463 MB/s
ChaCha20-Poly1305 73 MB/s 210 MB/s ⚡️ 145 MB/s 126 MB/s 143 MB/s

AArch64:

Cipher OpenSSL Ring Sodium RustCrypto(org) Crypto2
AES-128 484 MB/s N/A N/A 36 MB/s 1600 MB/s ⚡️
AES-128-CCM N/A N/A N/A 6 MB/s 285 MB/s ⚡️
AES-128-GCM 22 MB/s 210 MB/s N/A 14 MB/s 213 MB/s ⚡️
AES-128-GCM-SIV N/A N/A N/A 4 MB/s 29 MB/s ⚡️
AES-128-OCB-TAG128 18 MB/s N/A N/A N/A 219 MB/s ⚡️
AES-128-SIV-CMAC-256 N/A N/A N/A 3 MB/s 262 MB/s ⚡️
AES-256 N/A N/A N/A 27 MB/s 1066 MB/s ⚡️
AES-256-GCM N/A 183 MB/s ⚡️ N/A 11 MB/s 177 MB/s
ChaCha20 N/A N/A N/A 309 MB/s 390 MB/s ⚡️
ChaCha20-Poly1305 73 MB/s 163 MB/s ⚡️ 128 MB/s 114 MB/s 132 MB/s

@database64128
Copy link

In #243 (comment), 16B data is used for AES-GCM tests. I bumped the data size to 8 KiB, updated all crates to the latest version, and reran some of the tests.

On i5-7400 (avx2):

test Crypto2::aes_256_gcm           ... bench:       9,983 ns/iter (+/- 91) = 820 MB/s
test Crypto2::chacha20_poly1305     ... bench:      20,256 ns/iter (+/- 69) = 404 MB/s
test Mbedtls::aes_256_gcm           ... bench:      30,379 ns/iter (+/- 387) = 269 MB/s
test Mbedtls::chacha20_poly1305     ... bench:      27,447 ns/iter (+/- 1,127) = 298 MB/s
test OpenSSL::evp_aes_256_gcm       ... bench:       2,844 ns/iter (+/- 33) = 2880 MB/s
test OpenSSL::evp_chacha20_poly1305 ... bench:       4,703 ns/iter (+/- 114) = 1741 MB/s
test Ring::aes_256_gcm              ... bench:       2,529 ns/iter (+/- 117) = 3239 MB/s
test Ring::chacha20_poly1305        ... bench:       4,540 ns/iter (+/- 56) = 1804 MB/s
test RustCrypto::aes_256_gcm        ... bench:       6,667 ns/iter (+/- 90) = 1228 MB/s
test RustCrypto::chacha20_poly1305  ... bench:       6,759 ns/iter (+/- 99) = 1212 MB/s
test Sodium::aes_256_gcm            ... bench:       4,941 ns/iter (+/- 236) = 1657 MB/s
test Sodium::chacha20_poly1305      ... bench:       6,298 ns/iter (+/- 76) = 1300 MB/s

On Intel(R) Xeon(R) Platinum 8272CL (avx512 w/o vaes, vpclmulqdq):

test Crypto2::aes_256_gcm           ... bench:       9,783 ns/iter (+/- 46) = 837 MB/s
test Crypto2::chacha20_poly1305     ... bench:      19,347 ns/iter (+/- 44) = 423 MB/s
test Mbedtls::aes_256_gcm           ... bench:      39,355 ns/iter (+/- 77) = 208 MB/s
test Mbedtls::chacha20_poly1305     ... bench:      27,354 ns/iter (+/- 303) = 299 MB/s
test OpenSSL::evp_aes_256_gcm       ... bench:       2,810 ns/iter (+/- 38) = 2915 MB/s
test OpenSSL::evp_chacha20_poly1305 ... bench:       3,883 ns/iter (+/- 610) = 2109 MB/s
test Ring::aes_256_gcm              ... bench:       2,414 ns/iter (+/- 18) = 3393 MB/s
test Ring::chacha20_poly1305        ... bench:       4,461 ns/iter (+/- 12) = 1836 MB/s
test RustCrypto::aes_256_gcm        ... bench:       6,355 ns/iter (+/- 37) = 1289 MB/s
test RustCrypto::chacha20_poly1305  ... bench:       6,276 ns/iter (+/- 434) = 1305 MB/s
test Sodium::aes_256_gcm            ... bench:       4,824 ns/iter (+/- 12) = 1698 MB/s
test Sodium::chacha20_poly1305      ... bench:       6,575 ns/iter (+/- 654) = 1245 MB/s

@Schmid7k
Copy link

Hello everyone!
For the past weeks I've been looking at and benchmarking native Rust implementations of cryptographic algorithms, especially AES with different block modes and after benchmarking this GCM implementation I realized that its performance is still very concerning.

The benchmarks focus on encryption and are making use of criterion using the criterion-cycles-per-byte plugin to measure the performance of algorithms in cycles per byte (cpb). GCM's performance lacks quite far behind the performance of CTR mode, which produces totally normal results, although GCM should be similar in terms of cpb. On top of that it is even slower than CBC encryption when using a 128-bit key!

Ideally GCM should take ~0.64 cpb on modern hardware, so similar to CTR mode. (source)

All benchmarks are executed on an Intel I7 8700k with turboboost disabled and a core clock of 3.7GHz.

The command used to compile and execute the code is (using cargo criterion:
RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=+aes,+sse2,+sse4.1,+ssse3" cargo criterion --target=x86_64-unknown-linux-gnu

This is the [bench] profile in my Cargo.toml:

[profile.bench]
opt-level = 3
codegen-units = 1
debug = false
debug-assertions = false
lto = "fat"
rpath = false
#panic = "abort" This is ignored for bench profile
incremental = false

All code can be found here https://github.com/Schmid7k/RustCrypto-AES-Benchmarks

And here are the benchmarks.

AES-GCM benchmark:

aes-gcm/encrypt-128/1024                                                                             
                        time:   [3166.3372 cycles 3167.3106 cycles 3168.1860 cycles]
                        thrpt:  [3.0939 cpb 3.0931 cpb 3.0921 cpb]
aes-gcm/encrypt-256/1024                                                                             
                        time:   [3384.3591 cycles 3386.2510 cycles 3388.1828 cycles]
                        thrpt:  [3.3088 cpb 3.3069 cpb 3.3050 cpb]
aes-gcm/encrypt-128/2048                                                                             
                        time:   [5995.0415 cycles 5998.8007 cycles 6004.5825 cycles]
                        thrpt:  [2.9319 cpb 2.9291 cpb 2.9273 cpb]
aes-gcm/encrypt-256/2048                                                                             
                        time:   [6534.3417 cycles 6540.1580 cycles 6551.4891 cycles]
                        thrpt:  [3.1990 cpb 3.1934 cpb 3.1906 cpb]
aes-gcm/encrypt-128/4096                                                                             
                        time:   [11742.9702 cycles 11745.4626 cycles 11748.5664 cycles]
                        thrpt:  [2.8683 cpb 2.8675 cpb 2.8669 cpb]
aes-gcm/encrypt-256/4096                                                                             
                        time:   [12741.3502 cycles 12746.6942 cycles 12752.3922 cycles]
                        thrpt:  [3.1134 cpb 3.1120 cpb 3.1107 cpb]
aes-gcm/encrypt-128/8192                                                                             
                        time:   [23364.6832 cycles 23397.8105 cycles 23437.0330 cycles]
                        thrpt:  [2.8610 cpb 2.8562 cpb 2.8521 cpb]
aes-gcm/encrypt-256/8192                                                                             
                        time:   [25317.4570 cycles 25334.1060 cycles 25350.5001 cycles]
                        thrpt:  [3.0945 cpb 3.0925 cpb 3.0905 cpb]
aes-gcm/encrypt-128/16384                                                                             
                        time:   [46020.5519 cycles 46041.6698 cycles 46065.9051 cycles]
                        thrpt:  [2.8116 cpb 2.8102 cpb 2.8089 cpb]
aes-gcm/encrypt-256/16384                                                                             
                        time:   [49952.7849 cycles 49993.8998 cycles 50032.0588 cycles]
                        thrpt:  [3.0537 cpb 3.0514 cpb 3.0489 cpb]

AES-CTR benchmark:

aes-ctr/encrypt-128-128LE/1024                                                                            
                        time:   [749.9276 cycles 751.9313 cycles 756.2880 cycles]
                        thrpt:  [0.7386 cpb 0.7343 cpb 0.7324 cpb]
aes-ctr/encrypt-192-128LE/1024                                                                            
                        time:   [856.7136 cycles 856.8474 cycles 856.9919 cycles]
                        thrpt:  [0.8369 cpb 0.8368 cpb 0.8366 cpb]
aes-ctr/encrypt-256-128LE/1024                                                                            
                        time:   [982.0763 cycles 982.3649 cycles 982.7133 cycles]
                        thrpt:  [0.9597 cpb 0.9593 cpb 0.9591 cpb]
aes-ctr/encrypt-128-128LE/2048                                                                            
                        time:   [1496.1683 cycles 1496.3173 cycles 1496.4620 cycles]
                        thrpt:  [0.7307 cpb 0.7306 cpb 0.7306 cpb]
aes-ctr/encrypt-192-128LE/2048                                                                            
                        time:   [1707.5183 cycles 1707.6654 cycles 1707.8320 cycles]
                        thrpt:  [0.8339 cpb 0.8338 cpb 0.8337 cpb]
aes-ctr/encrypt-256-128LE/2048                                                                             
                        time:   [1963.7012 cycles 1966.8363 cycles 1970.6229 cycles]
                        thrpt:  [0.9622 cpb 0.9604 cpb 0.9588 cpb]
aes-ctr/encrypt-128-128LE/4096                                                                             
                        time:   [2989.6105 cycles 2989.9659 cycles 2990.3117 cycles]
                        thrpt:  [0.7301 cpb 0.7300 cpb 0.7299 cpb]
aes-ctr/encrypt-192-128LE/4096                                                                             
                        time:   [3422.4848 cycles 3422.9207 cycles 3423.5527 cycles]
                        thrpt:  [0.8358 cpb 0.8357 cpb 0.8356 cpb]
aes-ctr/encrypt-256-128LE/4096                                                                             
                        time:   [3933.6860 cycles 3935.2462 cycles 3936.6781 cycles]
                        thrpt:  [0.9611 cpb 0.9608 cpb 0.9604 cpb]
aes-ctr/encrypt-128-128LE/8192                                                                             
                        time:   [5947.6723 cycles 5948.6477 cycles 5950.0132 cycles]
                        thrpt:  [0.7263 cpb 0.7262 cpb 0.7260 cpb]
aes-ctr/encrypt-192-128LE/8192                                                                             
                        time:   [6791.4240 cycles 6800.9357 cycles 6815.0054 cycles]
                        thrpt:  [0.8319 cpb 0.8302 cpb 0.8290 cpb]
aes-ctr/encrypt-256-128LE/8192                                                                             
                        time:   [7823.5601 cycles 7827.0004 cycles 7831.2664 cycles]
                        thrpt:  [0.9560 cpb 0.9554 cpb 0.9550 cpb]
aes-ctr/encrypt-128-128LE/16384                                                                             
                        time:   [11855.2398 cycles 11856.0069 cycles 11856.8002 cycles]
                        thrpt:  [0.7237 cpb 0.7236 cpb 0.7236 cpb]
aes-ctr/encrypt-192-128LE/16384                                                                             
                        time:   [13601.7051 cycles 13604.9247 cycles 13610.6503 cycles]
                        thrpt:  [0.8307 cpb 0.8304 cpb 0.8302 cpb]
aes-ctr/encrypt-256-128LE/16384                                                                             
                        time:   [15873.0293 cycles 15929.4985 cycles 15995.4443 cycles]
                        thrpt:  [0.9763 cpb 0.9723 cpb 0.9688 cpb]

AES-CBC benchmark:

aes-cbc/encrypt-128/1024                                                                             
                        time:   [2777.8954 cycles 2777.9441 cycles 2778.0058 cycles]
                        thrpt:  [2.7129 cpb 2.7128 cpb 2.7128 cpb]
aes-cbc/encrypt-192/1024                                                                             
                        time:   [3293.0049 cycles 3293.1646 cycles 3293.3705 cycles]
                        thrpt:  [3.2162 cpb 3.2160 cpb 3.2158 cpb]
aes-cbc/encrypt-256/1024                                                                             
                        time:   [3807.5189 cycles 3808.0681 cycles 3808.6609 cycles]
                        thrpt:  [3.7194 cpb 3.7188 cpb 3.7183 cpb]
aes-cbc/encrypt-128/2048                                                                             
                        time:   [5559.8242 cycles 5561.2808 cycles 5563.0242 cycles]
                        thrpt:  [2.7163 cpb 2.7155 cpb 2.7148 cpb]
aes-cbc/encrypt-192/2048                                                                             
                        time:   [6587.3842 cycles 6587.9764 cycles 6588.6945 cycles]
                        thrpt:  [3.2171 cpb 3.2168 cpb 3.2165 cpb]
aes-cbc/encrypt-256/2048                                                                             
                        time:   [7614.8603 cycles 7616.7361 cycles 7619.1713 cycles]
                        thrpt:  [3.7203 cpb 3.7191 cpb 3.7182 cpb]
aes-cbc/encrypt-128/4096                                                                             
                        time:   [11111.5751 cycles 11112.0278 cycles 11112.5354 cycles]
                        thrpt:  [2.7130 cpb 2.7129 cpb 2.7128 cpb]
aes-cbc/encrypt-192/4096                                                                             
                        time:   [13173.6236 cycles 13174.8960 cycles 13176.3323 cycles]
                        thrpt:  [3.2169 cpb 3.2165 cpb 3.2162 cpb]
aes-cbc/encrypt-256/4096                                                                             
                        time:   [15222.8789 cycles 15231.3374 cycles 15245.7952 cycles]
                        thrpt:  [3.7221 cpb 3.7186 cpb 3.7165 cpb]
aes-cbc/encrypt-128/8192                                                                             
                        time:   [22222.2953 cycles 22225.1423 cycles 22228.3549 cycles]
                        thrpt:  [2.7134 cpb 2.7130 cpb 2.7127 cpb]
aes-cbc/encrypt-192/8192                                                                             
                        time:   [26345.8438 cycles 26348.2611 cycles 26350.9392 cycles]
                        thrpt:  [3.2167 cpb 3.2163 cpb 3.2160 cpb]
aes-cbc/encrypt-256/8192                                                                             
                        time:   [30446.3663 cycles 30447.2046 cycles 30448.0601 cycles]
                        thrpt:  [3.7168 cpb 3.7167 cpb 3.7166 cpb]
aes-cbc/encrypt-128/16384                                                                             
                        time:   [44444.2852 cycles 44447.8077 cycles 44453.9623 cycles]
                        thrpt:  [2.7133 cpb 2.7129 cpb 2.7127 cpb]
aes-cbc/encrypt-192/16384                                                                             
                        time:   [52691.9826 cycles 52696.3064 cycles 52700.8004 cycles]
                        thrpt:  [3.2166 cpb 3.2163 cpb 3.2161 cpb]
aes-cbc/encrypt-256/16384                                                                             
                        time:   [60903.6543 cycles 60908.2833 cycles 60912.7298 cycles]
                        thrpt:  [3.7178 cpb 3.7175 cpb 3.7173 cpb]

@tarcieri
Copy link
Member

@Schmid7k we already have criterion benchmarks that make use of criterion-cycles-per-byte here:

https://github.com/RustCrypto/AEADs/tree/master/benches

@newpavlov
Copy link
Member

@Schmid7k
Note that you usually do not need -Ctarget-feature when -Ctarget-cpu=native is specified. Compiler will use all available features for your CPU.

Also, curiously enough, -Ctarget-cpu=native often results in a worse codegen. For example, using only -Ctarget-feature results in 15-20% better throughput on my AMD Ryzen 7 2700x based PC compared to -Ctarget-cpu=native (0.49 vs 0.57 cpb).

AES-GCM should improve significantly when RustCrypto/traits#965 will land.

@Schmid7k
Copy link

@tarcieri Oh yeah I think I missed that.

@newpavlov Actually in my case it improves cpb by 0.1 - 0.2. I already tried all combinations of turning options on and off and what I have right now gives me the best performance overall.

Ahh I see, then I will look out for that!

@tarcieri
Copy link
Member

After RustCrypto/traits#965 lands I can try implementing #74 again. If the code optimizes correctly it should double the performance.

Also now that inline ASM is stable, we can add an asm feature and optionally use optimized inline ASM.

@Schmid7k
Copy link

Schmid7k commented Apr 1, 2022

I found out another interesting thing. Using nightly-2022-01-01-x86_64-unknown-linux-gnu as compiler actually improves the performance of AES-GCM on my machine compared to using the latest nightly compiler.
It's a difference of 0.3 cpb in both 128-bit and 256-bit cases.
Latest stable also produces better results, so there must have been something between the release of the latest stable on 24.02.2022 and the latest nightly that led to even further performance degradation.

@Schmid7k
Copy link

@Schmid7k Note that you usually do not need -Ctarget-feature when -Ctarget-cpu=native is specified. Compiler will use all available features for your CPU.

Also, curiously enough, -Ctarget-cpu=native often results in a worse codegen. For example, using only -Ctarget-feature results in 15-20% better throughput on my AMD Ryzen 7 2700x based PC compared to -Ctarget-cpu=native (0.49 vs 0.57 cpb).

AES-GCM should improve significantly when RustCrypto/traits#965 will land.

@newpavlov I just noticed that you mentioned a cpb measurement of 0.49 vs 0.57 in your comment. Is this for aes-gcm or some other mode?

@newpavlov
Copy link
Member

@Schmid7k
Those results are for CTR.

@Schmid7k
Copy link

Schmid7k commented Apr 24, 2022

I see, alright then. Btw I don't know if this is interesting to you but I found out that the performance between specifying -Ctarget-feature vs -Ctarget-cpu differs HEAVILY on the aes mode.
For example, when specifying -Ctarget-cpu when benchmarking aes-cbc I get insanely bad results, around 9-11 cpb. In contrast -Ctarget-feature gives normal results.
Now aes-gcm provides much better results when using -Ctarget-cpu instead of -Ctarget-feature.
What could be the reason for this? I tested both on an AMD Ryzen 7 4700U and an Intel i7 8700k and both showed the same reaction.

@newpavlov
Copy link
Member

IIUC target-cpu=native mainly allows compiler to do 2 things: unconditionally enable all target features available on the CPU and use CPU-specific values for latency/throughput/port usage of instructions. The biggest issue with the former is that it enables AVX2 instructions, which can cause CPU to reduce working frequency. The core code does not rely on such instructions, so they are used sparsely. Meaning you get reduced frequency and can not fully utilize AVX2 capabilities. In theory it should not influence cpb, but it's not so trivial. Read this blog post for more information.

It also possible that for some reason target-cpu=native causes bad codegen in the CBC case. You will need to inspect the generated assembly to see if it's indeed the case.

This is why I generally prefer to not rely on target-cpu=native.

@Schmid7k
Copy link

I understand, thanks for the insights!

@azet
Copy link

azet commented Jul 5, 2022

hey Rustycrypto,

I think OpenSSL Performance is an unfair comparison; as @tarcieri noted earlier in this thread OpenSSL has a dedicated person writing hand crafted assembly for different instruction sets. With Perl scripts to take away the pain of updating to CPU specific feature novelties, variations and new models. OpenSSL is now a fairly well funded project for FOSS standards. That person actually fixes more bugs in OpenSSL than he ever introduced as well. So is it a good idea to do the same with an unsafe { asm! { ... }} for a programming language which paradigms forbid general use of such hacks? I don't think so. You can still use a foreign function interface to access low-level OpenSSL cipher primitives if you need the optimized code speed in some application where it really matters (e.g. https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/rust-openssl.html)

I had more to say but GitHub swallowed my original comment draft so that's it for now.

PS: I don't see OCB anywhere :P

Happy hacking,
azet

@smessmer
Copy link

Are there any plans on improving performance? It's not only slow when compared to openssl. In my benchmarks, the aes-gcm implementation is about 2x as slow as the sodiumoxide implementation. Unfortunately, sodiumoxide isn't maintained anymore.

@tarcieri
Copy link
Member

tarcieri commented May 24, 2023

I think we're bottlenecked on the trait design of universal-hash, which prevents data from flowing through SIMD registers and is instead loading and storing it in RAM instead.

Without that we can't take advantage of pipelining between AES-NI and (P)CLMUL(QDQ), which would give us an expected 2X speedup, as it were. I had an issue for that here, which we should probably reopen:

RustCrypto/traits#444

See also: #74

As I mentioned before in this issue, we could also include inline ASM implementations for certain platforms, gated under an asm feature.

@tarcieri
Copy link
Member

Another option would be to add architecture-specific low-level APIs to crates like aes/polyval and chacha20/poly1305 which operate in terms of platform-native SIMD buffers, sidestepping the current trait-based APIs.

If we can get things performing well that way, I think it could help inform the overall trait design for RustCrypto/traits#444.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants