Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 optimized filter kernels for primitive types #1949

Open
jhorstmann opened this issue Jun 26, 2022 · 0 comments
Open

AVX512 optimized filter kernels for primitive types #1949

jhorstmann opened this issue Jun 26, 2022 · 0 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@jhorstmann
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In #1829 we removed AVX512 optimizations for AND/OR kernels since the autovectorized code was just as good, but there are some AVX512 instructions that could have a big benefit and which the compiler would not be able to use automatically. One of those extensions is the compressstore instruction which basically implements most of the filter kernel in a single instruction.

I recently experimented with those and found that, while our current filters are extremely good for extreme selectivities thanks to all the optimizations that @tustvold did, for selectivities between 5% and 99% the AVX512 version would be faster. For a random selectivity of 50% nearly 10x faster.

Describe the solution you'd like

There are a few open questions how to best integrate these functions into the filter kernels. They don't fit that well into the existing strategies, since they would be specific to primitive arrays, and there might be different selectivity cutoffs for falling back to one of the existing strategies.

We would also need to decide whether to statically dispatch to these kernels, based on target-cpu or target-feature, or use runtime feature detection.

The 8 and 16 bit versions of these instructions are also only available since the icelake generation, making testing a bit more difficult.

Describe alternatives you've considered

There is a discussion on the portable-simd about portable alternatives to these instructions but that would require quite some work in llvm, since there are not portable llvm intrinsics yet, only the x86/avx512 implementations.

Additional context

Benchmark results for filtering i32 running on a tigerlake machine running at 3Ghz:

Gnuplot not found, using plotters backend
filter i32 (kept 50%)   time:   [55.624 us 55.657 us 55.699 us]                                  

filter i32 high selectivity (kept 95%)                                                                             
                        time:   [18.635 us 18.650 us 18.671 us]

filter i32 low selectivity (kept 5%)                                                                             
                        time:   [5.6434 us 5.6778 us 5.7203 us]

filter i32 avx512 (kept 50%)                                                                             
                        time:   [6.0487 us 6.0529 us 6.0579 us]

filter i32 avx512 high selectivity (kept 95%)                                                                             
                        time:   [6.2818 us 6.2847 us 6.2879 us]

filter i32 avx512 low selectivity (kept 5%)                                                                             
                        time:   [5.4591 us 5.4618 us 5.4651 us]

@jhorstmann jhorstmann added the enhancement Any new improvement worthy of a entry in the changelog label Jun 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

1 participant