Releases · intel/x86-simd-sort

12 Feb 19:53

5b5884c

v5.0 Latest

Latest

Release 5.0 adds new API's to support sorting arrays of custom defined objects and sorting key-value pairs of arrays. We also now have AVX2 acceleration for argsort and argselect methods. Here is a gist of the new features:

x86-simd-sort now supports an API to sort custom defined C++ objects via object_qsort. Please refer to README file on how to use it. Its performance can vary depending on the definition of the custom class. For the sake of illustration, sorting a simple struct based on one of its members can be up-to 4-5x faster than using std::sort on machines with AVX-512. Refer to the perf section for more details.
New API keyvalue_qsort to sort a pair of arrays representing key-value pairs (32-bit and 64-bit data types). These are an order of magnitude faster when compared to scalar ways of sorting them.
AVX2 support for argsort and argselect methods. These have been merged into NumPy and will be available with NumPy v2.0.

What's Changed

README.md: fix broken link by @rouault in #98
Improve emulation of AVX2 min/max 64-bit by @r-devulap in #99
fix numpy CI failures by @r-devulap in #100
Add key-value sort to runtime dispatch by @r-devulap in #105
Avoid masks when possible in AVX2 logic by @sterrettm2 in #104
Add more benchmarks by @r-devulap in #106
Support key-value sort for 32-bit dtypes by @r-devulap in #108
Add API to sort array of custom objects by @r-devulap in #103
Mark explicit template specializations inline by @r-devulap in #112
BUG: bug fix in avx512_qsort_fp16 by @r-devulap in #113
CI: build and test on 32-bit linux by @r-devulap in #114
Support for AVX2 argsort/argselect/key-value sort by @sterrettm2 in #110
Changes argsort/argselect to use generic networks by @sterrettm2 in #102
Improve key-value sort performance by @r-devulap in #120
Add IPP sort to benchmarks by @r-devulap in #121
Build fix on macOS 64-bit by @r-devulap in #122
Fix more build issues on macOS by @r-devulap in #123
Add avx2_vector defintion for size_t on macOS by @r-devulap in #124
[fix] update link in README.md by @icfaust in #125
Use uint32_t instead of size_t for object sort by @r-devulap in #126
Add simple test for objsort by @r-devulap in #128
Update README with object_qsort by @r-devulap in #130

New Contributors

@rouault made their first contribution in #98
@icfaust made their first contribution in #125

Full Changelog: v4.0...v5.0

Contributors

rouault, icfaust, and 2 other contributors

Assets 2

31 Oct 16:17

r-devulap

v4.0

7559f70

v4.0

v4.0 is a significant release with new features and improvements. AVX-512 sorting methods gain up to 2x perf improvements and we have added AVX2 sorting methods to support a wider range of x86 processors. In additional to using it as a header file library, x86-simd-sort can be installed as a library, and it provides API access to various sorting methods with automatic runtime dispatch to select the fastest version based on the processor. Here is a quick summary of all the changes:

Added AVX2 implementations of avx2_qsort, avx2_qselect and avx2_partial_qsort for 32-bit and 64-bit data types. When compared to std::sort, these are up to 12x faster for 32-bit data and up to 7x faster for 64-bit data.
x86-simd-sort can now be built and installed as a shared library. The library provides runtime dispatch and automatically picks the fastest version among AVX-512/AVX2/scalar depending on the processor it is run on. Starting with clearlinux v40270, you can install x86-simd-sort with swupd bundle-add x86-simd-sort.
Perf improvements to avx512_qsort: 2x speed up for 32-bit data, 1.5x speed up for 64-bit data and 1.25x speed up for 16-bit data.
Perf improvements to avx512_argsort and avx512_argselect intended to mitigate the effect of a vulnerability in gather instruction.

What's Changed

Changes quicksort and quickselect to use template based sorting networks by @sterrettm2 in #61
Correct documentation on NaN behavior by @zbjornson in #73
update CI that builds NumPy by @r-devulap in #79
Fix MSVC build error by @sterrettm2 in #76
Use scalar emulation of gather instruction for arg methods by @r-devulap in #65
Bug fix in avx512_qselect_fp16 by @r-devulap in #80
Build shared library with runtime ISA dispatch by @r-devulap in #74
CI: Do not install google benchmarks and use different compiler versions by @r-devulap in #82
Add option to compare benchmarks across branches by @r-devulap in #84
Use env var to disable/enable avx512 dispatch by @r-devulap in #85
Various performance improvements by @sterrettm2 in #83
Fix NumPy CI failures by @r-devulap in #86
CI: numpy/core is renamed to numpy/_core by @r-devulap in #87
Enable argsort and argselect methods on 32-bit platforms by @r-devulap in #88
Disable prefetch on clang-cl WIN32 by @r-devulap in #89
Explicitly initialize with zmm_vector<uint64_t> for arg methods on macOS by @r-devulap in #90
Reorganize code and add examples to build by @r-devulap in #91
Adds support for AVX2 for 32-bit types for quicksort and quickselect by @sterrettm2 in #60
Add hasnan = false to all the sort methods by @r-devulap in #92
Provide a way to install x86simdsort as a library by @r-devulap in #94
AVX2 64-bit support by @sterrettm2 in #93
Add soversion to the lib by @r-devulap in #95

New Contributors

@sterrettm2 made their first contribution in #61
@zbjornson made their first contribution in #73

Full Changelog: v3.0...4.0.rc

Contributors

zbjornson, r-devulap, and sterrettm2

Assets 2

06 Oct 20:49

r-devulap

v3.0

b9f9340

v3.0

Version 3.0 release contains a new supported method avx512_argselect to compute arg nth_element (also known as argpartition in NumPy). It returns an array of indices that would partition the data array. Highlights of this release include:

v3.0 x86-simd-sort is merged into NumPy main branch. It provides AVX-512 vectorized versions of np.partition and np.argpartition . It speeds up np.partition up by up to 25x for 16-bit, 17x for 32-bit dtypes and about 8x speed up for 64-bit dtypes. Speeds up for np.argpartition are up-to 6.5x.
A slightly modified version of x86-simd-sort has now been merged into OpenJDK . It speeds up sorting 32-bit and 64-bit data by up to 15x and 7x respectively.

What's Changed

Fix typo in README by @r-devulap in #50
Update workflow for changes in benchmark tooling by @mosullivan93 in #54
Further Makefile updates by @mosullivan93 in #52
Add avx512_argselect for 32-bit and 64-bit dtypes by @r-devulap in #56
Use __builtin_cpu_supports instead of cpuinfo by @r-devulap in #58
MAINT: Remove template specializations for quicksort methods by @r-devulap in #59
Add benchmarks for small arrays by @r-devulap in #62
Improvement to benchmarking scripts by @r-devulap in #66
Bug fix in benchmark script by @r-devulap in #67
Use global Macros for GCC specific keywords by @r-devulap in #68
Fix compiler warnings in src and tests by @r-devulap in #69
Fix more compiler warnings by @r-devulap in #70

Full Changelog: v2.0...v3.0

Contributors

mosullivan93 and r-devulap

Assets 2

23 Jun 19:59

r-devulap

v2.0

587af58

v2.0

Release v2.0 adds a host of new sorting algorithms and supported API's. We also welcome new members to the community who contributed heavily to this release and we are deeply grateful for their efforts. We’d like to acknowledge Mitchell J. O'Sullivan @mosullivan93 and @ruclz /Xi Tang @xtangxtang for adding partial sorting methods and key-value sorting methods.. Here are the main highlights of this release:

New features:

AVX-512 based argsort algorithms using O(1) space for 32-bit and 64-bit data types [1]. It returns the indices that would sort an array. These are up to 6x faster when compared to a scalar solution that uses std::sort. This new feature is leveraged by NumPy in np.argsort [2] in its latest release v1.25.
AVX-512 based quick select algorithm [3] for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::nthelement but performs a lot faster. The performance depends on the ratio K/N (where K is the index of the element the array is partitioned around and N is the array size). For smaller values of K, it is up to 6x faster for 64-bit data, up to 15x faster for 32-bit data and up to 7x faster for 16-bit data. The performance gets better as K gets larger (see #13 (comment) for more details).
AVX-512 based partial sort algorithm [3] for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::partial_sort. As with quick select, its performance depends on the ratio K/N (where K is the size of partial sorted array and N is the array size) and it tends to perform an order of magnitude faster for larger values. It is about 1.05x faster for tiny partial array sort and up to 20x faster for slightly larger partial arrays. See this #49 (comment) for detailed performance numbers.
AVX-512 based key-value sort [4]. This sorts a pair of key-value arrays and is currently supported only for 64-bit data types. A version of this has been contributed to Oceanbase, an open source distributed relational database. See [8].
AVX-512 sort for _Float16 data type [5] using AVX-512 FP16 ISA. In NumPy, these are nearly 3x faster than AVX-512 based sort that emulates float16 data type [7] .

Improvements:

Improvements to AVX-512 quicksort [6] further speeds up sorting 64 and 32-bit data types by up to 1.6x and 1.3x respectively.

New Contributors

@ruclz made their first contribution in #2
@bkmgit made their first contribution in #21
@mosullivan93 made their first contribution in #13

Pull request references:

[1] Add AVX-512 argsort for 32 and 64-bit data type by @r-devulap in #34
[2] numpy/numpy#23707
[3] Implement partial sorting algorithms by @mosullivan93 in #13
[4] add key-value file! by @ruclz in #2
[5] Add qsort for _Float16 using AVX-512 FP16 ISA by @r-devulap in #16
[6] Improve qsort by @r-devulap in #33
[7] numpy/numpy#23435
[8] oceanbase/oceanbase#1325

Full Changelog: v1.0...v2.0

Contributors

bkmgit, xtangxtang, and 3 other contributors

Assets 2

07 Mar 22:46

r-devulap

v1.0

7d7591c

v1.0

First release of x86-simd-sort. This is the version that was merged into NumPy main branch. See numpy/numpy#22315 for more details. Supported features:

AVX-512 quicksort for the following dtypes: float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

New features:

Improvements:

New Contributors

Pull request references:

Contributors

Releases: intel/x86-simd-sort

v5.0

What's Changed

New Contributors

Contributors

v4.0

What's Changed

New Contributors

Contributors

v3.0

What's Changed

Contributors

v2.0

New features:

Improvements:

New Contributors

Pull request references:

Contributors

v1.0