-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: sparse: single-pass CSR matrix binops #19765
base: main
Are you sure you want to change the base?
Conversation
The CSR elementwise binary operators have two implementations: one for canonical and another for general formats. This merges the format check with the canonical format implementation. Now optimistically execute the canonical implementation with a fallback to the general if the operands are not canonical. This eliminates a full memory scan of each operand, which speeds up this memory-bound operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @alugowski! First benchmark run (same as on gh-19766) seems to show that things are indeed significantly faster, except for the DIA format.
These benchmarks need a bit of care, because the result is now going to depend strongly on whether the result is in canonical form or not. Here, it turns out that If we insert something like this in the benchmark setup code to ensure both matrices are in canonical form: if hasattr(x, 'sum_duplicates'):
x.sum_duplicates()
if hasattr(self.y, 'sum_duplicates'):
self.y.sum_duplicates() we get:
There's also a all of Running the
I'm not yet sure what to make of all this. Maybe you can run the same asv benchmark @alugowski and see how that looks for you? |
Repeating the benchmark script on a macOS arm64 (M1) machine, I do see similar results as in the PR description:
It looks like the trade-off here depends pretty strongly on CPU and memory architecture. |
I just pushed a commit to clean up the implementation a bit: I removed some extra variables and ensured the ones that remain are always initialized. I haven't run the benchmarks locally yet, but I definitely want to understand why we'd see regressions for |
ASV benchmarks without the changes to ensure canonical format ahead of time:
|
The benchmark changes for DIA are almost certainly noise, because adding two DIA-format matrices together doesn't involve conversion to CSR or CSC, and doesn't use sparsetools. EDIT: I was mistaken here. The |
Here are my ASV results after removing the
Looks like the CSR-AB-multiply benchmark is the only real regression, so I'll take a look into why that's happening next. |
In the one remaining regression case, A has canonical format and B does not. My current hypothesis for the slowdown is that previously, we would do:
Now, we do:
The old checks (1) and (2) are actually rather fast, because A.nnz < B.nnz and we bail out of the B checking very quickly because the indices are very obviously unsorted. They also have good cache locality, as we're iterating linearly through one set of indices at a time. The new codepath does more work before it can bail out, and it has interleaved accesses for both sets of indices and data arrays, which is a lot less cache-friendly. But why does this appear as a regression for elementwise multiply but not addition? Maybe because the extra multiplications we do during the speculative canonical-mode call are more costly? |
One other observation: the benchmarking in the PR description only uses operands in canonical format. I would expect that we'd see less of a speedup (or even a slowdown) if we measured with non-canonical matrices. |
Reference issue
No issue, saw this optimization opportunity while working in this file.
What does this implement/fix?
The CSR elementwise binary operators have two implementations: one for canonical form (no duplicate elements, elements in order) and another for general.
Previous behavior first checks both operands then selects the appropriate implementation. However this check requires a full scan of both operands. The canonical form binop operation itself is also a single scan of both operands, so the form check adds an extra scan. Binops are a memory bandwidth-bound operation.
This PR merges the format check with the canonical form implementation. In other words, merges
csr_has_canonical_format()
intocsr_binop_csr_canonical()
.New behavior is to optimistically execute the canonical implementation with a fallback to the general if the operands are not canonical, eliminating one full memory scan of both operands.
Impact
Expect a 25% to 45% speedup on large matrices.
See attached a synthetic
bench_sparse_add.py
script which benchmarks elementwise add runtime on random and Poisson 2d matrices of increasing nnz:try yourself:
python dev.py python bench_sparse_add.py
bench_sparse_add.py.txt