Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove simd and avx512 bitwise kernels in favor of autovectorization #1830

Merged
merged 2 commits into from Jun 12, 2022

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

Closes #1829.

Rationale for this change

The autovectorized implementation is actually faster, allowing us to simplify the buffer code. Benchmark results are in the linked issue.

What changes are included in this PR?

Are there any user-facing changes?

Removal of the avx512 feature could be a breaking change if someone had it enabled.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 9, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2022

Codecov Report

Merging #1830 (40eb503) into master (db41b33) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1830      +/-   ##
==========================================
- Coverage   83.53%   83.49%   -0.05%     
==========================================
  Files         200      201       +1     
  Lines       56798    56902     +104     
==========================================
+ Hits        47449    47511      +62     
- Misses       9349     9391      +42     
Impacted Files Coverage Δ
arrow/src/buffer/ops.rs 96.77% <100.00%> (ø)
arrow/src/compute/kernels/temporal.rs 95.77% <0.00%> (-1.36%) ⬇️
parquet/src/encodings/encoding.rs 93.46% <0.00%> (-0.20%) ⬇️
arrow/src/array/mod.rs 100.00% <0.00%> (ø)
parquet/src/arrow/mod.rs 44.44% <0.00%> (ø)
...arquet/src/arrow/array_reader/dictionary_buffer.rs
parquet/src/arrow/bit_util.rs
parquet/src/arrow/array_reader/offset_buffer.rs
parquet/src/arrow/levels.rs
parquet/src/arrow/converter.rs
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db41b33...40eb503. Read the comment docs.

@tustvold
Copy link
Contributor

I plan to run these benchmarks later today on a server-class machine and confirm there is no performance delta

@tustvold
Copy link
Contributor

On a Intel Cascade Lake Xeon(R) CPU @ 3.10GHz, specifically a GCP c2-standard-16

Nightly with defaults

buffer_bit_ops and      time:   [419.91 ns 420.01 ns 420.12 ns]
buffer_bit_ops or       time:   [550.34 ns 550.43 ns 550.54 ns]

Nightly with simd

buffer_bit_ops and      time:   [252.84 ns 253.71 ns 254.75 ns]
buffer_bit_ops or       time:   [276.70 ns 276.77 ns 276.85 ns]

Nightly with avx512

buffer_bit_ops and      time:   [338.87 ns 339.09 ns 339.37 ns]
buffer_bit_ops or       time:   [365.72 ns 365.78 ns 365.86 ns]

Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [177.27 ns 177.32 ns 177.38 ns]                               
buffer_bit_ops or       time:   [290.42 ns 290.47 ns 290.52 ns]     

Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [199.39 ns 199.42 ns 199.45 ns]                               
buffer_bit_ops or       time:   [227.88 ns 227.93 ns 227.98 ns]

Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [199.58 ns 199.64 ns 199.73 ns]                               
buffer_bit_ops or       time:   [229.27 ns 229.30 ns 229.34 ns]  

Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [166.14 ns 166.19 ns 166.26 ns]
buffer_bit_ops or       time:   [208.24 ns 208.30 ns 208.36 ns]

Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [197.55 ns 197.58 ns 197.60 ns]
buffer_bit_ops or       time:   [223.72 ns 223.79 ns 223.86 ns]

Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [200.34 ns 200.38 ns 200.41 ns]
buffer_bit_ops or       time:   [328.80 ns 328.84 ns 328.89 ns]

Stable with defaults RUSTFLAGS="-Ctarget-cpu=native"

buffer_bit_ops and      time:   [178.72 ns 178.77 ns 178.82 ns]                               
buffer_bit_ops or       time:   [294.65 ns 294.69 ns 294.74 ns]      

Stable with defaults RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"

buffer_bit_ops and      time:   [176.34 ns 176.82 ns 177.45 ns]                               
buffer_bit_ops or       time:   [200.99 ns 201.08 ns 201.17 ns]    

Conclusion

  • simd feature is always faster than avx512 feature
  • With target-cpu=native the LLVM generated buffer_bit_ops is faster than the simd version
  • With target-feature=-prefer-256-bit the LLVM generated code is better than either of the hand-rolled loops
  • Performance between stable and nightly is very similar

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should call out the recommend RUSTFLAGS somewhere if we don't already

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks looks great. I run some internal benchmarks with diabling simd and the RUSTFLAGS. I don't see performance downgrade.

@viirya
Copy link
Member

viirya commented Jun 10, 2022

One question. Because the RUSTFLAGS="-Ctarget-cpu=xxx" is for optimizing on specific CPU model. But for existing simd feature, we don't need to specify CPU. Compared with existing simd feature, is the RUSTFLAGS less general?

@jhorstmann
Copy link
Contributor Author

Thanks for reproducing the results! It's interesting that you also seem to get the effect of the second benchmark function being slower. I don't think it's downclocking related, on my machine the frequency looked constant and cascade lake shouldn't downclock either. Pinning the process to one core also did not make a difference. But I assume this is a effect that will only be visible in a microbenchmark and not have an effect on larger programs

@jhorstmann
Copy link
Contributor Author

One question. Because the RUSTFLAGS="-Ctarget-cpu=xxx" is for optimizing on specific CPU model. But for existing simd feature, we don't need to specify CPU. Compared with existing simd feature, is the RUSTFLAGS less general?

The simd feature also benefits from specifying the target-cpu. Without target-cpu, the compiler has to produce code for a common baseline. For x86_64, that baseline includes the SSE2 instructions, which work on 128bit vectors, but misses many useful features added later. I'm guessing here the compiler decides that it's not worthwhile to use sse2 instructions, but it's heuristic seems wrong here.

There probably is some additional compiler flag that would make the standard kernel as fast as the simd kernel also for the default target. We probably can't expect everyone to set several flags though.

@jhorstmann
Copy link
Contributor Author

Would someone be able to also benchmark this on ARM, for example on an Apple M1 or AWS graviton, to avoid any regressions there?

I'm assuming that someone who cares about performance enough to enable the simd feature would also target a specific cpu baseline. But I would suggest to get some more feedback before merging.

@nevi-me
Copy link
Contributor

nevi-me commented Jun 12, 2022

Add notes about target-cpu to README

Would someone be able to also benchmark this on ARM, for example on an Apple M1 or AWS graviton, to avoid any regressions there?

I'm assuming that someone who cares about performance enough to enable the simd feature would also target a specific cpu baseline. But I would suggest to get some more feedback before merging.

I'll run a benchmark and report back in an hour to two

@nevi-me
Copy link
Contributor

nevi-me commented Jun 12, 2022

M1 Pro

  • nosimd = nightly master with RUSTFLAGS="-Ctarget-cpu=native"
  • simd = nightly master with above and --features simd
  • pr1830 = nightly on this PR with the above
  • stable = stable on this PR with target-cpu=native (didn't check master as there should be no difference)
  • stablenoflags = stable on this PR with no flags
group                                nosimd                                 pr1830                                 simd                                   stable                                 stablenoflags
-----                                ------                                 ------                                 ----                                   ------                                 -------------
buffer_binary_ops/and                1.01    153.2±0.97ns    93.4 GB/sec    1.03    157.6±8.65ns    90.8 GB/sec    1.25    189.8±1.68ns    75.4 GB/sec    1.03   156.6±21.14ns    91.4 GB/sec    1.00    152.4±1.84ns    93.9 GB/sec
buffer_binary_ops/and_with_offset    1.02    536.2±4.21ns    26.7 GB/sec    1.02    533.9±2.03ns    26.8 GB/sec    1.03    541.7±4.97ns    26.4 GB/sec    1.00    524.5±2.23ns    27.3 GB/sec    1.01    527.3±1.98ns    27.1 GB/sec
buffer_binary_ops/or                 1.04    156.3±7.65ns    91.5 GB/sec    1.02    154.2±0.76ns    92.8 GB/sec    1.31   197.4±47.84ns    72.5 GB/sec    1.00    150.7±2.41ns    94.9 GB/sec    1.01    151.5±1.14ns    94.4 GB/sec
buffer_binary_ops/or_with_offset     1.02    539.7±4.22ns    26.5 GB/sec    1.02    539.1±2.84ns    26.5 GB/sec    1.04    549.1±4.72ns    26.1 GB/sec    1.00    530.2±3.16ns    27.0 GB/sec    1.01    532.8±6.84ns    26.8 GB/sec
buffer_unary_ops/not                 1.17   197.0±43.99ns    48.4 GB/sec    1.09    183.2±3.57ns    52.1 GB/sec    1.00    168.7±3.64ns    56.5 GB/sec    1.08    182.1±4.18ns    52.4 GB/sec    1.08    181.7±3.15ns    52.5 GB/sec
buffer_unary_ops/not_with_offset     1.00    363.8±6.18ns    26.2 GB/sec    1.02    369.1±2.30ns    25.8 GB/sec    1.02    368.6±1.09ns    25.9 GB/sec    1.02   368.8±23.20ns    25.9 GB/sec    1.00    362.6±1.81ns    26.3 GB/sec

A bit hard to interpret the results, but what I can see is that stable with a CPU target flag isn't much that better than without. That could make sense as there shouldn't be (m)any differences between ARMv8 CPUs that are supported, unlike x64 where there's all sorts of SIMD versions.

Of the 6 results, 5 are fastest in stable, but not by a large variation compared to the other options, unlike what we saw with x64 on @tustvold 's results.

Seeing as there's no regression, I'll merge this. Thanks @jhorstmann!

@nevi-me nevi-me merged commit fb697ce into apache:master Jun 12, 2022
let mut result = MutableBuffer::new(len).with_bitset(len, false);
let lanes = u8x64::lanes();

let mut left_chunks = left.as_slice()[left_offset..].chunks_exact(lanes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually also buggy and should have been sliced using [left_offset..left_offset+len]. The effect was that if the buffer was larger then len the remainders would not line up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AVX512 + simd binary and/or kernels slower than autovectorized version
5 participants