Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

jhorstmann · 2022-06-09T22:38:59Z

Which issue does this PR close?

Closes #1829.

Rationale for this change

The autovectorized implementation is actually faster, allowing us to simplify the buffer code. Benchmark results are in the linked issue.

What changes are included in this PR?

Are there any user-facing changes?

Removal of the avx512 feature could be a breaking change if someone had it enabled.

…ly slower than the autovectorized version

codecov-commenter · 2022-06-09T22:56:15Z

Codecov Report

Merging #1830 (40eb503) into master (db41b33) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1830      +/-   ##
==========================================
- Coverage   83.53%   83.49%   -0.05%     
==========================================
  Files         200      201       +1     
  Lines       56798    56902     +104     
==========================================
+ Hits        47449    47511      +62     
- Misses       9349     9391      +42

Impacted Files	Coverage Δ
arrow/src/buffer/ops.rs	`96.77% <100.00%> (ø)`
arrow/src/compute/kernels/temporal.rs	`95.77% <0.00%> (-1.36%)`	⬇️
parquet/src/encodings/encoding.rs	`93.46% <0.00%> (-0.20%)`	⬇️
arrow/src/array/mod.rs	`100.00% <0.00%> (ø)`
parquet/src/arrow/mod.rs	`44.44% <0.00%> (ø)`
...arquet/src/arrow/array_reader/dictionary_buffer.rs
parquet/src/arrow/bit_util.rs
parquet/src/arrow/array_reader/offset_buffer.rs
parquet/src/arrow/levels.rs
parquet/src/arrow/converter.rs
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db41b33...40eb503. Read the comment docs.

tustvold · 2022-06-10T08:30:00Z

I plan to run these benchmarks later today on a server-class machine and confirm there is no performance delta

tustvold · 2022-06-10T16:09:50Z

On a Intel Cascade Lake Xeon(R) CPU @ 3.10GHz, specifically a GCP c2-standard-16

Nightly with defaults

buffer_bit_ops and      time:   [419.91 ns 420.01 ns 420.12 ns]
buffer_bit_ops or       time:   [550.34 ns 550.43 ns 550.54 ns]

Nightly with simd

buffer_bit_ops and      time:   [252.84 ns 253.71 ns 254.75 ns]
buffer_bit_ops or       time:   [276.70 ns 276.77 ns 276.85 ns]

Nightly with avx512

buffer_bit_ops and      time:   [338.87 ns 339.09 ns 339.37 ns]
buffer_bit_ops or       time:   [365.72 ns 365.78 ns 365.86 ns]

Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [177.27 ns 177.32 ns 177.38 ns]                               
buffer_bit_ops or       time:   [290.42 ns 290.47 ns 290.52 ns]

Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [199.39 ns 199.42 ns 199.45 ns]                               
buffer_bit_ops or       time:   [227.88 ns 227.93 ns 227.98 ns]

Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [199.58 ns 199.64 ns 199.73 ns]                               
buffer_bit_ops or       time:   [229.27 ns 229.30 ns 229.34 ns]

Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [166.14 ns 166.19 ns 166.26 ns]
buffer_bit_ops or       time:   [208.24 ns 208.30 ns 208.36 ns]

Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [197.55 ns 197.58 ns 197.60 ns]
buffer_bit_ops or       time:   [223.72 ns 223.79 ns 223.86 ns]

Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [200.34 ns 200.38 ns 200.41 ns]
buffer_bit_ops or       time:   [328.80 ns 328.84 ns 328.89 ns]

Stable with defaults RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [178.72 ns 178.77 ns 178.82 ns]                               
buffer_bit_ops or       time:   [294.65 ns 294.69 ns 294.74 ns]

Stable with defaults RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [176.34 ns 176.82 ns 177.45 ns]                               
buffer_bit_ops or       time:   [200.99 ns 201.08 ns 201.17 ns]

Conclusion

simd feature is always faster than avx512 feature
With target-cpu=native the LLVM generated buffer_bit_ops is faster than the simd version
With target-feature=-prefer-256-bit the LLVM generated code is better than either of the hand-rolled loops
Performance between stable and nightly is very similar

tustvold

Perhaps we should call out the recommend RUSTFLAGS somewhere if we don't already

viirya

The benchmarks looks great. I run some internal benchmarks with diabling simd and the RUSTFLAGS. I don't see performance downgrade.

viirya · 2022-06-10T17:36:07Z

One question. Because the RUSTFLAGS="-Ctarget-cpu=xxx" is for optimizing on specific CPU model. But for existing simd feature, we don't need to specify CPU. Compared with existing simd feature, is the RUSTFLAGS less general?

jhorstmann · 2022-06-10T19:04:24Z

Thanks for reproducing the results! It's interesting that you also seem to get the effect of the second benchmark function being slower. I don't think it's downclocking related, on my machine the frequency looked constant and cascade lake shouldn't downclock either. Pinning the process to one core also did not make a difference. But I assume this is a effect that will only be visible in a microbenchmark and not have an effect on larger programs

jhorstmann · 2022-06-10T19:17:33Z

One question. Because the RUSTFLAGS="-Ctarget-cpu=xxx" is for optimizing on specific CPU model. But for existing simd feature, we don't need to specify CPU. Compared with existing simd feature, is the RUSTFLAGS less general?

The simd feature also benefits from specifying the target-cpu. Without target-cpu, the compiler has to produce code for a common baseline. For x86_64, that baseline includes the SSE2 instructions, which work on 128bit vectors, but misses many useful features added later. I'm guessing here the compiler decides that it's not worthwhile to use sse2 instructions, but it's heuristic seems wrong here.

There probably is some additional compiler flag that would make the standard kernel as fast as the simd kernel also for the default target. We probably can't expect everyone to set several flags though.

jhorstmann · 2022-06-12T12:05:28Z

Would someone be able to also benchmark this on ARM, for example on an Apple M1 or AWS graviton, to avoid any regressions there?

I'm assuming that someone who cares about performance enough to enable the simd feature would also target a specific cpu baseline. But I would suggest to get some more feedback before merging.

nevi-me · 2022-06-12T15:51:51Z

Add notes about target-cpu to README

Would someone be able to also benchmark this on ARM, for example on an Apple M1 or AWS graviton, to avoid any regressions there?

I'm assuming that someone who cares about performance enough to enable the simd feature would also target a specific cpu baseline. But I would suggest to get some more feedback before merging.

I'll run a benchmark and report back in an hour to two

nevi-me · 2022-06-12T17:08:14Z

M1 Pro

nosimd = nightly master with RUSTFLAGS="-Ctarget-cpu=native"
simd = nightly master with above and --features simd
pr1830 = nightly on this PR with the above
stable = stable on this PR with target-cpu=native (didn't check master as there should be no difference)
stablenoflags = stable on this PR with no flags

group                                nosimd                                 pr1830                                 simd                                   stable                                 stablenoflags
-----                                ------                                 ------                                 ----                                   ------                                 -------------
buffer_binary_ops/and                1.01    153.2±0.97ns    93.4 GB/sec    1.03    157.6±8.65ns    90.8 GB/sec    1.25    189.8±1.68ns    75.4 GB/sec    1.03   156.6±21.14ns    91.4 GB/sec    1.00    152.4±1.84ns    93.9 GB/sec
buffer_binary_ops/and_with_offset    1.02    536.2±4.21ns    26.7 GB/sec    1.02    533.9±2.03ns    26.8 GB/sec    1.03    541.7±4.97ns    26.4 GB/sec    1.00    524.5±2.23ns    27.3 GB/sec    1.01    527.3±1.98ns    27.1 GB/sec
buffer_binary_ops/or                 1.04    156.3±7.65ns    91.5 GB/sec    1.02    154.2±0.76ns    92.8 GB/sec    1.31   197.4±47.84ns    72.5 GB/sec    1.00    150.7±2.41ns    94.9 GB/sec    1.01    151.5±1.14ns    94.4 GB/sec
buffer_binary_ops/or_with_offset     1.02    539.7±4.22ns    26.5 GB/sec    1.02    539.1±2.84ns    26.5 GB/sec    1.04    549.1±4.72ns    26.1 GB/sec    1.00    530.2±3.16ns    27.0 GB/sec    1.01    532.8±6.84ns    26.8 GB/sec
buffer_unary_ops/not                 1.17   197.0±43.99ns    48.4 GB/sec    1.09    183.2±3.57ns    52.1 GB/sec    1.00    168.7±3.64ns    56.5 GB/sec    1.08    182.1±4.18ns    52.4 GB/sec    1.08    181.7±3.15ns    52.5 GB/sec
buffer_unary_ops/not_with_offset     1.00    363.8±6.18ns    26.2 GB/sec    1.02    369.1±2.30ns    25.8 GB/sec    1.02    368.6±1.09ns    25.9 GB/sec    1.02   368.8±23.20ns    25.9 GB/sec    1.00    362.6±1.81ns    26.3 GB/sec

A bit hard to interpret the results, but what I can see is that stable with a CPU target flag isn't much that better than without. That could make sense as there shouldn't be (m)any differences between ARMv8 CPUs that are supported, unlike x64 where there's all sorts of SIMD versions.

Of the 6 results, 5 are fastest in stable, but not by a large variation compared to the other options, unlike what we saw with x64 on @tustvold 's results.

Seeing as there's no regression, I'll merge this. Thanks @jhorstmann!

jhorstmann · 2022-06-16T09:51:38Z

arrow/src/buffer/ops.rs

-    let mut result = MutableBuffer::new(len).with_bitset(len, false);
-    let lanes = u8x64::lanes();
-
-    let mut left_chunks = left.as_slice()[left_offset..].chunks_exact(lanes);


This was actually also buggy and should have been sliced using [left_offset..left_offset+len]. The effect was that if the buffer was larger then len the remainders would not line up.

Remove simd and avx512 bitwise kernels since they are actually slight…

8e28308

…ly slower than the autovectorized version

github-actions bot added the arrow Changes to the arrow crate label Jun 9, 2022

tustvold approved these changes Jun 10, 2022

View reviewed changes

viirya approved these changes Jun 10, 2022

View reviewed changes

nevi-me approved these changes Jun 11, 2022

View reviewed changes

Add notes about target-cpu to README

40eb503

nevi-me merged commit fb697ce into apache:master Jun 12, 2022

jhorstmann commented Jun 16, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

jhorstmann commented Jun 9, 2022

codecov-commenter commented Jun 9, 2022 •

edited

tustvold commented Jun 10, 2022

tustvold commented Jun 10, 2022

tustvold left a comment

viirya left a comment

viirya commented Jun 10, 2022

jhorstmann commented Jun 10, 2022

jhorstmann commented Jun 10, 2022

jhorstmann commented Jun 12, 2022

nevi-me commented Jun 12, 2022

nevi-me commented Jun 12, 2022

jhorstmann Jun 16, 2022

Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

Conversation

jhorstmann commented Jun 9, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jun 9, 2022 • edited

Codecov Report

tustvold commented Jun 10, 2022

tustvold commented Jun 10, 2022

Nightly with defaults

Nightly with simd

Nightly with avx512

Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native"

Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native"

Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native"

Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

Stable with defaults RUSTFLAGS="-Ctarget-cpu=native"

Stable with defaults RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

Conclusion

tustvold left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

viirya commented Jun 10, 2022

jhorstmann commented Jun 10, 2022

jhorstmann commented Jun 10, 2022

jhorstmann commented Jun 12, 2022

nevi-me commented Jun 12, 2022

nevi-me commented Jun 12, 2022

jhorstmann Jun 16, 2022

Choose a reason for hiding this comment

codecov-commenter commented Jun 9, 2022 •

edited