ENH: sparse: single-pass CSR matrix binops #19765

alugowski · 2023-12-27T00:44:49Z

Reference issue

No issue, saw this optimization opportunity while working in this file.

What does this implement/fix?

The CSR elementwise binary operators have two implementations: one for canonical form (no duplicate elements, elements in order) and another for general.

Previous behavior first checks both operands then selects the appropriate implementation. However this check requires a full scan of both operands. The canonical form binop operation itself is also a single scan of both operands, so the form check adds an extra scan. Binops are a memory bandwidth-bound operation.

This PR merges the format check with the canonical form implementation. In other words, merges csr_has_canonical_format() into csr_binop_csr_canonical().

New behavior is to optimistically execute the canonical implementation with a fallback to the general if the operands are not canonical, eliminating one full memory scan of both operands.

Impact

Expect a 25% to 45% speedup on large matrices.

See attached a synthetic bench_sparse_add.py script which benchmarks elementwise add runtime on random and Poisson 2d matrices of increasing nnz:

	main	this PR	Speedup
random-50,000	0.00028791697695851326	0.00015637511387467384	1.84
poisson-49,600	0.0001479999627918005	0.00010145897977054119	1.46
random-500,000	0.001349416095763445	0.0009573330171406269	1.41
poisson-498,016	0.0014221249148249626	0.0010420831385999918	1.36
random-5,000,000	0.014241500059142709	0.011061792029067874	1.29
poisson-4,996,000	0.015750708989799023	0.010916250059381127	1.44
random-50,000,000	0.15272545884363353	0.11954350001178682	1.28
poisson-49,978,572	0.16468741605058312	0.1213894160464406	1.36

try yourself: python dev.py python bench_sparse_add.py

bench_sparse_add.py.txt

The CSR elementwise binary operators have two implementations: one for canonical and another for general formats. This merges the format check with the canonical format implementation. Now optimistically execute the canonical implementation with a fallback to the general if the operands are not canonical. This eliminates a full memory scan of each operand, which speeds up this memory-bound operation.

VitalyChait

lgtm

rgommers · 2023-12-28T17:20:19Z

Thanks @alugowski!

First benchmark run (same as on gh-19766) seems to show that things are indeed significantly faster, except for the DIA format.

| Change   | Before [bee9b9ae] <main>   | After [f427cf01] <fastbinop>   |   Ratio | Benchmark (Parameter)                                      |
|----------|----------------------------|--------------------------------|---------|------------------------------------------------------------|
| +        | 47.1±0.7ms                 | 54.9±5ms                       |    1.17 | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__sub__')  |
| +        | 67.1±1ms                   | 77.9±7ms                       |    1.16 | sparse.Arithmetic.time_arithmetic('dia', 'BB', 'multiply') |
| +        | 67.8±0.2ms                 | 77.3±7ms                       |    1.14 | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__add__')  |
| -        | 72.0±7ms                   | 64.6±1ms                       |    0.9  | sparse.Arithmetic.time_arithmetic('csr', 'AB', '__sub__')  |
| -        | 162±7ms                    | 143±3ms                        |    0.88 | sparse.Arithmetic.time_arithmetic('coo', 'AB', 'multiply') |
| -        | 174±20ms                   | 145±1ms                        |    0.84 | sparse.Arithmetic.time_arithmetic('coo', 'AB', '__sub__')  |
| -        | 25.7±2ms                   | 21.1±0.09ms                    |    0.82 | sparse.Arithmetic.time_arithmetic('csr', 'AA', 'multiply') |
| -        | 23.3±2ms                   | 18.9±0.4ms                     |    0.81 | sparse.Arithmetic.time_arithmetic('csc', 'AA', '__sub__')  |
| -        | 76.3±8ms                   | 61.0±2ms                       |    0.8  | sparse.Arithmetic.time_arithmetic('csr', 'AB', 'multiply') |
| -        | 26.8±3ms                   | 18.3±0.2ms                     |    0.68 | sparse.Arithmetic.time_arithmetic('csr', 'AA', '__add__')  |

These benchmarks need a bit of care, because the result is now going to depend strongly on whether the result is in canonical form or not. Here, it turns out that A is canonical while B is not, due to it being constructed as A ** 2.

If we insert something like this in the benchmark setup code to ensure both matrices are in canonical form:

        if hasattr(x, 'sum_duplicates'):
            x.sum_duplicates()
        if hasattr(self.y, 'sum_duplicates'):
            self.y.sum_duplicates()

we get:

| Change   | Before [bee9b9ae] <main>   | After [f427cf01] <fastbinop>   |   Ratio | Benchmark (Parameter)                                      |
|----------|----------------------------|--------------------------------|---------|------------------------------------------------------------|
| +        | 28.4±0.3ms                 | 40.1±0.7ms                     |    1.41 | sparse.Arithmetic.time_arithmetic('csr', 'BB', '__sub__')  |
| +        | 28.5±0.4ms                 | 39.1±0.3ms                     |    1.37 | sparse.Arithmetic.time_arithmetic('coo', 'BB', '__sub__')  |
| +        | 28.3±0.2ms                 | 38.9±0.3ms                     |    1.37 | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__sub__')  |
| +        | 30.2±1ms                   | 39.5±0.4ms                     |    1.31 | sparse.Arithmetic.time_arithmetic('csc', 'BB', '__sub__')  |
| +        | 50.0±2ms                   | 64.6±3ms                       |    1.29 | sparse.Arithmetic.time_arithmetic('csr', 'AB', '__sub__')  |
| +        | 46.8±2ms                   | 60.0±2ms                       |    1.28 | sparse.Arithmetic.time_arithmetic('csr', 'AB', 'multiply') |
| +        | 51.4±0.1ms                 | 64.5±1ms                       |    1.26 | sparse.Arithmetic.time_arithmetic('csc', 'BA', '__sub__')  |
| +        | 51.3±3ms                   | 64.8±1ms                       |    1.26 | sparse.Arithmetic.time_arithmetic('csr', 'AB', '__add__')  |
| +        | 52.7±2ms                   | 66.0±1ms                       |    1.25 | sparse.Arithmetic.time_arithmetic('csc', 'AB', '__sub__')  |
| +        | 50.8±2ms                   | 63.7±2ms                       |    1.25 | sparse.Arithmetic.time_arithmetic('csr', 'BA', '__sub__')  |
| +        | 54.7±0.8ms                 | 67.7±0.9ms                     |    1.24 | sparse.Arithmetic.time_arithmetic('csc', 'AB', 'multiply') |
| +        | 62.5±1ms                   | 76.9±1ms                       |    1.23 | sparse.Arithmetic.time_arithmetic('coo', 'AB', '__add__')  |
| +        | 54.3±0.5ms                 | 66.8±0.9ms                     |    1.23 | sparse.Arithmetic.time_arithmetic('csc', 'BA', 'multiply') |
| +        | 48.2±2ms                   | 58.7±3ms                       |    1.22 | sparse.Arithmetic.time_arithmetic('csr', 'BA', 'multiply') |
| +        | 60.3±0.9ms                 | 73.0±2ms                       |    1.21 | sparse.Arithmetic.time_arithmetic('coo', 'AB', 'multiply') |
| +        | 52.3±3ms                   | 63.4±2ms                       |    1.21 | sparse.Arithmetic.time_arithmetic('csr', 'BA', '__add__')  |
| +        | 62.3±2ms                   | 74.6±2ms                       |    1.2  | sparse.Arithmetic.time_arithmetic('coo', 'AB', '__sub__')  |
| +        | 61.9±1ms                   | 73.6±2ms                       |    1.19 | sparse.Arithmetic.time_arithmetic('coo', 'BA', '__sub__')  |
| +        | 54.2±0.8ms                 | 64.5±1ms                       |    1.19 | sparse.Arithmetic.time_arithmetic('csc', 'AB', '__add__')  |
| +        | 54.0±0.6ms                 | 64.0±0.4ms                     |    1.19 | sparse.Arithmetic.time_arithmetic('csc', 'BA', '__add__')  |
| +        | 65.3±4ms                   | 75.8±2ms                       |    1.16 | sparse.Arithmetic.time_arithmetic('coo', 'BA', 'multiply') |
| +        | 210±6ms                    | 233±7ms                        |    1.11 | sparse.Arithmetic.time_arithmetic('dia', 'BA', '__sub__')  |
| +        | 67.4±2ms                   | 74.2±2ms                       |    1.1  | sparse.Arithmetic.time_arithmetic('coo', 'BA', '__add__')  |
| +        | 209±5ms                    | 222±8ms                        |    1.06 | sparse.Arithmetic.time_arithmetic('dia', 'AB', '__add__')  |
| +        | 213±6ms                    | 225±7ms                        |    1.06 | sparse.Arithmetic.time_arithmetic('dia', 'BA', 'multiply') |
| -        | 59.3±2ms                   | 53.9±0.6ms                     |    0.91 | sparse.Arithmetic.time_arithmetic('coo', 'BB', '__add__')  |
| -        | 60.4±0.9ms                 | 54.6±0.6ms                     |    0.9  | sparse.Arithmetic.time_arithmetic('csc', 'BB', '__add__')  |
| -        | 59.3±0.5ms                 | 53.4±0.1ms                     |    0.9  | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__add__')  |
| -        | 53.1±2ms                   | 47.1±2ms                       |    0.89 | sparse.Arithmetic.time_arithmetic('coo', 'AA', '__add__')  |
| -        | 26.8±0.6ms                 | 23.8±0.2ms                     |    0.89 | sparse.Arithmetic.time_arithmetic('csc', 'AA', 'multiply') |
| -        | 26.9±0.3ms                 | 20.9±0.2ms                     |    0.78 | sparse.Arithmetic.time_arithmetic('csc', 'AA', '__add__')  |
| -        | 23.5±0.3ms                 | 18.3±0.2ms                     |    0.78 | sparse.Arithmetic.time_arithmetic('csr', 'AA', '__add__')  |

B should be in canonical form there, but the results hint at the new single-pass check for that failing for __sub__ and multiply. At least, otherwise I can't explain why those results regress here. I'd also expect the __sub__ and __add__ results to be symmetric, so something isn't quite right yet it looks like.

There's also a all of -Wmaybe-uninitialized compiler warnings - it'd be good to address those, and perhaps that catches something.

Running the bench_sparse_add.py script from the PR description, the results improve but not as much as in the PR description, more like in the 0% - 10% range for me:

random-50,000:  0.00027637499442789704
poisson-49,600:         0.00023411099391523749
random-500,000:         0.001635348002309911
poisson-498,016:        0.0017717820010147989
random-5,000,000:       0.023091215000022203
poisson-4,996,000:      0.02406818899908103
random-50,000,000:      0.22385444400424603
poisson-49,978,572:     0.23443531700468156

# main:
random-50,000:  0.00038524300907738507
poisson-49,600:         0.00021164999634493142
random-500,000:         0.00149274300201796
poisson-498,016:        0.0016441019979538396
random-5,000,000:       0.022221983992494643
poisson-4,996,000:      0.025533260995871387
random-50,000,000:      0.21116333699319512
poisson-49,978,572:     0.2499628399964422

# main run 2:
random-50,000:  0.00038719200529158115
poisson-49,600:         0.00022731799981556833
random-500,000:         0.001550414992379956
poisson-498,016:        0.0016773520037531853
random-5,000,000:       0.022551914007635787
poisson-4,996,000:      0.02617731600184925
random-50,000,000:      0.21719301299890503
poisson-49,978,572:     0.2543543719948502

# this PR, run 2:
random-50,000:  0.0002743800141615793
poisson-49,600:         0.00022587399871554226
random-500,000:         0.0016353649989468977
poisson-498,016:        0.0016516460018465295
random-5,000,000:       0.02321807699627243
poisson-4,996,000:      0.023670254988246597
random-50,000,000:      0.2263502830028301
poisson-49,978,572:     0.23783767299028113

I'm not yet sure what to make of all this. Maybe you can run the same asv benchmark @alugowski and see how that looks for you?

rgommers · 2023-12-28T17:32:32Z

Repeating the benchmark script on a macOS arm64 (M1) machine, I do see similar results as in the PR description:

# this PR
random-50,000:  0.00017312500131083652
poisson-49,600:         0.00012250000145286322
random-500,000:         0.0010954169993055984
poisson-498,016:        0.0013311670045368373
random-5,000,000:       0.01243333300226368
poisson-4,996,000:      0.013911583999288268
random-50,000,000:      0.12745545800135005
poisson-49,978,572:     0.14511337500152877

# run 2:
random-50,000:  0.00017150000348920003
poisson-49,600:         0.0001229170011356473
random-500,000:         0.0010238329996354878
poisson-498,016:        0.001206166998599656
random-5,000,000:       0.013366833998588845
poisson-4,996,000:      0.013843332999385893
random-50,000,000:      0.12439233400073135
poisson-49,978,572:     0.1430313750024652


# main:
random-50,000:  0.00029024999821558595
poisson-49,600:         0.00013862500054528937
random-500,000:         0.0014803340018261224
poisson-498,016:        0.0016382500034524128
random-5,000,000:       0.015349457993579563
poisson-4,996,000:      0.015778542001498863
random-50,000,000:      0.15929004100325983
poisson-49,978,572:     0.16495433400268666

# run 2:
random-50,000:  0.00027474999660626054
poisson-49,600:         0.0001384170027449727
random-500,000:         0.0013510419958038256
poisson-498,016:        0.0014166249966365285
random-5,000,000:       0.016083499998785555
poisson-4,996,000:      0.01596837499528192
random-50,000,000:      0.15811200000462122
poisson-49,978,572:     0.16117887500149664

It looks like the trade-off here depends pretty strongly on CPU and memory architecture.

perimosocordiae · 2024-05-15T05:28:59Z

I just pushed a commit to clean up the implementation a bit: I removed some extra variables and ensured the ones that remain are always initialized.

I haven't run the benchmarks locally yet, but I definitely want to understand why we'd see regressions for __sub__ but not __add__ here.

perimosocordiae · 2024-05-15T18:00:22Z

ASV benchmarks without the changes to ensure canonical format ahead of time:

Change	Before [`8ce393f`]	After [d9a04dc7]	Ratio	Benchmark (Parameter)
+	3.48±0ms	4.04±0ms	1.16	sparse.Arithmetic.time_arithmetic('dia', 'BB', 'add')
+	3.58±0ms	4.07±0ms	1.14	sparse.Arithmetic.time_arithmetic('dia', 'BB', 'multiply')
+	4.97±0ms	5.36±0ms	1.08	sparse.Arithmetic.time_arithmetic('csc', 'BA', 'add')
+	13.6±0ms	14.3±0ms	1.05	sparse.Arithmetic.time_arithmetic('csc', 'BA', 'mul')
-	9.72±0ms	9.19±0ms	0.95	sparse.Arithmetic.time_arithmetic('coo', 'AB', 'add')
-	11.4±0ms	10.7±0ms	0.94	sparse.Arithmetic.time_arithmetic('coo', 'AA', 'add')
-	11.5±0ms	10.7±0ms	0.94	sparse.Arithmetic.time_arithmetic('coo', 'AA', 'multiply')
-	9.68±0ms	9.13±0ms	0.94	sparse.Arithmetic.time_arithmetic('coo', 'AB', 'sub')
-	6.65±0ms	6.27±0ms	0.94	sparse.Arithmetic.time_arithmetic('csr', 'AA', 'mul')
-	9.96±0ms	9.30±0ms	0.93	sparse.Arithmetic.time_arithmetic('coo', 'BA', 'sub')
-	4.61±0ms	4.28±0ms	0.93	sparse.Arithmetic.time_arithmetic('csr', 'AB', 'add')
-	10.4±0ms	9.66±0ms	0.93	sparse.Arithmetic.time_arithmetic('dia', 'AB', 'multiply')
-	17.6±0ms	16.1±0ms	0.91	sparse.Arithmetic.time_arithmetic('coo', 'AA', 'mul')
-	4.06±0ms	3.64±0ms	0.9	sparse.Arithmetic.time_arithmetic('coo', 'BB', 'sub')
-	4.70±0ms	4.20±0ms	0.89	sparse.Arithmetic.time_arithmetic('csr', 'AB', 'sub')
-	4.69±0ms	4.20±0ms	0.89	sparse.Arithmetic.time_arithmetic('csr', 'BA', 'sub')
-	3.57±0ms	3.16±0ms	0.89	sparse.Arithmetic.time_arithmetic('dia', 'BB', 'sub')
-	4.24±0ms	3.69±0ms	0.87	sparse.Arithmetic.time_arithmetic('csc', 'BB', 'sub')
-	3.63±0ms	3.14±0ms	0.86	sparse.Arithmetic.time_arithmetic('csr', 'BB', 'sub')
-	10.9±0ms	9.19±0ms	0.84	sparse.Arithmetic.time_arithmetic('coo', 'AA', 'sub')
-	2.03±0ms	1.54±0ms	0.76	sparse.Arithmetic.time_arithmetic('csr', 'AA', 'multiply')
-	998±0μs	708±0μs	0.71	sparse.Arithmetic.time_arithmetic('csr', 'AA', 'sub')

perimosocordiae · 2024-05-15T20:29:49Z

The benchmark changes for DIA are almost certainly noise, because adding two DIA-format matrices together doesn't involve conversion to CSR or CSC, and doesn't use sparsetools.

EDIT: I was mistaken here. The B matrix is defined as A**2, which converts from DIA to CSR. The results are still noisy due to running ASV with the --quick flag set by default, though.

perimosocordiae · 2024-05-16T18:26:49Z

Here are my ASV results after removing the --quick flag and increasing the number of rounds to 4:

Change	Before [`8ce393f`]	After [d9a04dc7]	Ratio	Benchmark (Parameter)
+	3.49±0.02ms	4.08±0.06ms	1.17	sparse.Arithmetic.time_arithmetic('csr', 'AB', 'multiply')
-	19.2±0.4ms	18.2±0.4ms	0.95	sparse.Arithmetic.time_arithmetic('dia', 'AA', 'multiply')
-	14.5±0.5ms	13.7±0.4ms	0.94	sparse.Arithmetic.time_arithmetic('coo', 'AA', 'mul')
-	6.21±0.03ms	5.81±0.03ms	0.94	sparse.Arithmetic.time_arithmetic('coo', 'AB', 'add')
-	1.34±0.02ms	1.26±0.01ms	0.94	sparse.Arithmetic.time_arithmetic('csc', 'AA', 'sub')
-	8.77±0.06ms	8.28±0.06ms	0.94	sparse.Arithmetic.time_arithmetic('dia', 'AB', 'sub')
-	1.00±0ms	932±5μs	0.93	sparse.Arithmetic.time_arithmetic('csr', 'AA', 'add')
-	3.25±0.02ms	3.03±0.02ms	0.93	sparse.Arithmetic.time_arithmetic('csr', 'AB', 'add')
-	8.87±0.05ms	8.23±0.05ms	0.93	sparse.Arithmetic.time_arithmetic('dia', 'BA', 'sub')
-	6.24±0.04ms	5.76±0.02ms	0.92	sparse.Arithmetic.time_arithmetic('coo', 'AB', 'sub')
-	3.26±0.01ms	2.98±0.02ms	0.91	sparse.Arithmetic.time_arithmetic('csr', 'AB', 'sub')
-	7.27±0.3ms	6.56±0.2ms	0.9	sparse.Arithmetic.time_arithmetic('coo', 'AA', 'sub')
-	6.37±0.03ms	5.73±0.04ms	0.9	sparse.Arithmetic.time_arithmetic('coo', 'BA', 'sub')
-	3.39±0.02ms	2.98±0.02ms	0.88	sparse.Arithmetic.time_arithmetic('csc', 'BA', 'sub')
-	3.41±0.02ms	2.99±0.01ms	0.88	sparse.Arithmetic.time_arithmetic('csr', 'BA', 'sub')
-	3.59±0.02ms	3.09±0.01ms	0.86	sparse.Arithmetic.time_arithmetic('coo', 'BB', 'sub')
-	3.60±0.05ms	3.09±0.01ms	0.86	sparse.Arithmetic.time_arithmetic('csc', 'BB', 'sub')
-	3.58±0.02ms	3.09±0.02ms	0.86	sparse.Arithmetic.time_arithmetic('csr', 'BB', 'sub')
-	3.58±0.02ms	3.08±0.01ms	0.86	sparse.Arithmetic.time_arithmetic('dia', 'BB', 'sub')
-	1.20±0.01ms	940±5μs	0.79	sparse.Arithmetic.time_arithmetic('csr', 'AA', 'multiply')
-	935±30μs	614±4μs	0.66	sparse.Arithmetic.time_arithmetic('csr', 'AA', 'sub')

Looks like the CSR-AB-multiply benchmark is the only real regression, so I'll take a look into why that's happening next.

perimosocordiae · 2024-05-18T01:53:02Z

In the one remaining regression case, A has canonical format and B does not. My current hypothesis for the slowdown is that previously, we would do:

check has_canonical_format(A) -> true in O(A.nnz) time
check has_canonical_format(B) -> false in < O(B.nnz) time
run csr_binop_csr_general(A, B)

Now, we do:

run csr_binop_csr_canonical(A, B) -> false
run csr_binop_csr_general(A, B)

The old checks (1) and (2) are actually rather fast, because A.nnz < B.nnz and we bail out of the B checking very quickly because the indices are very obviously unsorted. They also have good cache locality, as we're iterating linearly through one set of indices at a time.

The new codepath does more work before it can bail out, and it has interleaved accesses for both sets of indices and data arrays, which is a lot less cache-friendly.

But why does this appear as a regression for elementwise multiply but not addition? Maybe because the extra multiplications we do during the speculative canonical-mode call are more costly?

perimosocordiae · 2024-05-18T02:06:12Z

One other observation: the benchmarking in the PR description only uses operands in canonical format. I would expect that we'd see less of a speedup (or even a slowdown) if we measured with non-canonical matrices.

github-actions bot added scipy.sparse C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement labels Dec 27, 2023

VitalyChait approved these changes Dec 27, 2023

View reviewed changes

Fix missed case

f427cf0

This was referenced Dec 27, 2023

BLD: sparse: coalesce binops to cut _sparsetools.so library size in half #19766

Open

ENH: Parallelize CSR (sparse matrix) methods #19717

Open

perimosocordiae self-requested a review March 4, 2024 19:34

Remove extra locals, avoid uninitialized access

8409933

perimosocordiae approved these changes May 15, 2024

View reviewed changes

lucascolley changed the title ~~ENH: single-pass CSR sparse matrix binops~~ ENH: sparse: single-pass CSR matrix binops May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: sparse: single-pass CSR matrix binops #19765

ENH: sparse: single-pass CSR matrix binops #19765

alugowski commented Dec 27, 2023

VitalyChait left a comment

rgommers commented Dec 28, 2023

rgommers commented Dec 28, 2023

perimosocordiae commented May 15, 2024

perimosocordiae commented May 15, 2024

perimosocordiae commented May 15, 2024 •

edited

perimosocordiae commented May 16, 2024

perimosocordiae commented May 18, 2024

perimosocordiae commented May 18, 2024

ENH: sparse: single-pass CSR matrix binops #19765

Are you sure you want to change the base?

ENH: sparse: single-pass CSR matrix binops #19765

Conversation

alugowski commented Dec 27, 2023

Reference issue

What does this implement/fix?

Impact

VitalyChait left a comment

Choose a reason for hiding this comment

rgommers commented Dec 28, 2023

rgommers commented Dec 28, 2023

perimosocordiae commented May 15, 2024

perimosocordiae commented May 15, 2024

perimosocordiae commented May 15, 2024 • edited

perimosocordiae commented May 16, 2024

perimosocordiae commented May 18, 2024

perimosocordiae commented May 18, 2024

perimosocordiae commented May 15, 2024 •

edited