AVX512 + simd binary and/or kernels slower than autovectorized version #1829

jhorstmann · 2022-06-09T20:17:42Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Related to extending the tests for different features flags #1822, I wanted to take another look at the avx512 feature and its performance. Benchmarks were run on an i9-11900KB @ 3Ghz (turbo disabled) with

RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

(the second flag might require some explanation, it disables the prefer-256-bit feature, which makes llvm use the full 512 bit vectors)

For some reason the second benchmark is always significantly slower when run together, running them separately gives the same (higher) performance and the assembly looks identical except for the and/or. I'm guessing branch predictor or allocator related.

Default features

$ cargo +nightly bench --bench buffer_bit_ops
buffer_bit_ops/buffer_bit_ops and                                                                            
                        time:   [134.57 ns 134.90 ns 135.28 ns]
                        thrpt:  [105.74 GiB/s 106.04 GiB/s 106.30 GiB/s]
buffer_bit_ops/buffer_bit_ops or                                                                            
                        time:   [275.55 ns 276.22 ns 277.03 ns]
                        thrpt:  [51.637 GiB/s 51.789 GiB/s 51.914 GiB/s]

Simd feature

$ cargo +nightly bench --features simd --bench buffer_bit_ops
buffer_bit_ops/buffer_bit_ops and                                                                            
                        time:   [168.90 ns 169.10 ns 169.32 ns]
                        thrpt:  [84.486 GiB/s 84.596 GiB/s 84.697 GiB/s]
buffer_bit_ops/buffer_bit_ops or                                                                            
                        time:   [303.13 ns 303.27 ns 303.45 ns]
                        thrpt:  [47.142 GiB/s 47.169 GiB/s 47.192 GiB/s]

Avx512 feature

$ cargo +nightly bench --features avx512 --bench buffer_bit_ops -- 
buffer_bit_ops/buffer_bit_ops and                                                                            
                        time:   [165.46 ns 165.95 ns 166.83 ns]
                        thrpt:  [85.745 GiB/s 86.203 GiB/s 86.458 GiB/s]
buffer_bit_ops/buffer_bit_ops or                                                                            
                        time:   [310.63 ns 311.32 ns 312.04 ns]
                        thrpt:  [45.844 GiB/s 45.950 GiB/s 46.052 GiB/s]

Generated assembly for simd and avx512 looks identical, the loop calculates 512bits/64 bytes. The auto-vectorized version instead gets unrolled 4 times, which reduces the loop overhead, so each iteration processes 4x512bits.

Describe the solution you'd like

With these benchmark results it seems that we can remove the avx512 feature and simplify the buffer code.

Probably the compiler auto-vectorization got improved since we added the avx512 feature, or the creation of buffers using from_trusted_len_iter lead to some improvements.

Describe alternatives you've considered

An avx512 feature for other kernels would still be very useful. Avx512 for example has instructions that basically implement the filter kernel for primitives in a single instruction and it is unlikely that these will be supported in a portable way soon (rust-lang/portable-simd#240).

The text was updated successfully, but these errors were encountered:

jhorstmann · 2022-06-09T20:24:59Z

The auto-vectorized version of and/or with non-zero offsets amazingly also runs at about 50 GiB/s

tatsuya6502 · 2022-06-10T06:58:40Z

For some reason the second benchmark is always significantly slower when run together, running them separately gives the same (higher) performance and the assembly looks identical except for the and/or. I'm guessing branch predictor or allocator related.

You might want to do some sampling on CPU frequencies using lscpu -e or something while running these benchmarks. Since AVX-512 SIMD instructions consume much more power than regular 64 byte instructions (registers are eight times longer), they can produce more heat and CPU cores can reduce the base frequencies.

This article by Cloudflare explains about accidental AVX-512 throttling:
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Note that the article was written in 2017, and today's processors could do better than that. I am not sure if it still applies or not.

jhorstmann added the enhancement Any new improvement worthy of a entry in the changelog label Jun 9, 2022

jhorstmann mentioned this issue Jun 9, 2022

Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

Merged

nevi-me closed this as completed in #1830 Jun 12, 2022

alamb added arrow Changes to the arrow crate bug and removed enhancement Any new improvement worthy of a entry in the changelog labels Jun 23, 2022

jhorstmann mentioned this issue Jun 26, 2022

AVX512 optimized filter kernels for primitive types #1949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX512 + simd binary and/or kernels slower than autovectorized version #1829

AVX512 + simd binary and/or kernels slower than autovectorized version #1829

jhorstmann commented Jun 9, 2022 •

edited

jhorstmann commented Jun 9, 2022

tatsuya6502 commented Jun 10, 2022

AVX512 + simd binary and/or kernels slower than autovectorized version #1829

AVX512 + simd binary and/or kernels slower than autovectorized version #1829

Comments

jhorstmann commented Jun 9, 2022 • edited

Default features

Simd feature

Avx512 feature

jhorstmann commented Jun 9, 2022

tatsuya6502 commented Jun 10, 2022

jhorstmann commented Jun 9, 2022 •

edited