Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sparse: single-pass CSR matrix binops #19765

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alugowski
Copy link
Contributor

Reference issue

No issue, saw this optimization opportunity while working in this file.

What does this implement/fix?

The CSR elementwise binary operators have two implementations: one for canonical form (no duplicate elements, elements in order) and another for general.

Previous behavior first checks both operands then selects the appropriate implementation. However this check requires a full scan of both operands. The canonical form binop operation itself is also a single scan of both operands, so the form check adds an extra scan. Binops are a memory bandwidth-bound operation.

This PR merges the format check with the canonical form implementation. In other words, merges csr_has_canonical_format() into csr_binop_csr_canonical().

New behavior is to optimistically execute the canonical implementation with a fallback to the general if the operands are not canonical, eliminating one full memory scan of both operands.

Impact

Expect a 25% to 45% speedup on large matrices.

See attached a synthetic bench_sparse_add.py script which benchmarks elementwise add runtime on random and Poisson 2d matrices of increasing nnz:

main this PR Speedup
random-50,000 0.00028791697695851326 0.00015637511387467384 1.84
poisson-49,600 0.0001479999627918005 0.00010145897977054119 1.46
random-500,000 0.001349416095763445 0.0009573330171406269 1.41
poisson-498,016 0.0014221249148249626 0.0010420831385999918 1.36
random-5,000,000 0.014241500059142709 0.011061792029067874 1.29
poisson-4,996,000 0.015750708989799023 0.010916250059381127 1.44
random-50,000,000 0.15272545884363353 0.11954350001178682 1.28
poisson-49,978,572 0.16468741605058312 0.1213894160464406 1.36

try yourself: python dev.py python bench_sparse_add.py

bench_sparse_add.py.txt

The CSR elementwise binary operators have two implementations: one for canonical and another for general formats. This merges the format check with the canonical format implementation. Now optimistically execute the canonical implementation with a fallback to the general if the operands are not canonical. This eliminates a full memory scan of each operand, which speeds up this memory-bound operation.
@github-actions github-actions bot added scipy.sparse C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement labels Dec 27, 2023
Copy link
Contributor

@VitalyChait VitalyChait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rgommers
Copy link
Member

Thanks @alugowski!

First benchmark run (same as on gh-19766) seems to show that things are indeed significantly faster, except for the DIA format.

| Change   | Before [bee9b9ae] <main>   | After [f427cf01] <fastbinop>   |   Ratio | Benchmark (Parameter)                                      |
|----------|----------------------------|--------------------------------|---------|------------------------------------------------------------|
| +        | 47.1±0.7ms                 | 54.9±5ms                       |    1.17 | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__sub__')  |
| +        | 67.1±1ms                   | 77.9±7ms                       |    1.16 | sparse.Arithmetic.time_arithmetic('dia', 'BB', 'multiply') |
| +        | 67.8±0.2ms                 | 77.3±7ms                       |    1.14 | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__add__')  |
| -        | 72.0±7ms                   | 64.6±1ms                       |    0.9  | sparse.Arithmetic.time_arithmetic('csr', 'AB', '__sub__')  |
| -        | 162±7ms                    | 143±3ms                        |    0.88 | sparse.Arithmetic.time_arithmetic('coo', 'AB', 'multiply') |
| -        | 174±20ms                   | 145±1ms                        |    0.84 | sparse.Arithmetic.time_arithmetic('coo', 'AB', '__sub__')  |
| -        | 25.7±2ms                   | 21.1±0.09ms                    |    0.82 | sparse.Arithmetic.time_arithmetic('csr', 'AA', 'multiply') |
| -        | 23.3±2ms                   | 18.9±0.4ms                     |    0.81 | sparse.Arithmetic.time_arithmetic('csc', 'AA', '__sub__')  |
| -        | 76.3±8ms                   | 61.0±2ms                       |    0.8  | sparse.Arithmetic.time_arithmetic('csr', 'AB', 'multiply') |
| -        | 26.8±3ms                   | 18.3±0.2ms                     |    0.68 | sparse.Arithmetic.time_arithmetic('csr', 'AA', '__add__')  |

These benchmarks need a bit of care, because the result is now going to depend strongly on whether the result is in canonical form or not. Here, it turns out that A is canonical while B is not, due to it being constructed as A ** 2.

If we insert something like this in the benchmark setup code to ensure both matrices are in canonical form:

        if hasattr(x, 'sum_duplicates'):
            x.sum_duplicates()
        if hasattr(self.y, 'sum_duplicates'):
            self.y.sum_duplicates()

we get:

| Change   | Before [bee9b9ae] <main>   | After [f427cf01] <fastbinop>   |   Ratio | Benchmark (Parameter)                                      |
|----------|----------------------------|--------------------------------|---------|------------------------------------------------------------|
| +        | 28.4±0.3ms                 | 40.1±0.7ms                     |    1.41 | sparse.Arithmetic.time_arithmetic('csr', 'BB', '__sub__')  |
| +        | 28.5±0.4ms                 | 39.1±0.3ms                     |    1.37 | sparse.Arithmetic.time_arithmetic('coo', 'BB', '__sub__')  |
| +        | 28.3±0.2ms                 | 38.9±0.3ms                     |    1.37 | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__sub__')  |
| +        | 30.2±1ms                   | 39.5±0.4ms                     |    1.31 | sparse.Arithmetic.time_arithmetic('csc', 'BB', '__sub__')  |
| +        | 50.0±2ms                   | 64.6±3ms                       |    1.29 | sparse.Arithmetic.time_arithmetic('csr', 'AB', '__sub__')  |
| +        | 46.8±2ms                   | 60.0±2ms                       |    1.28 | sparse.Arithmetic.time_arithmetic('csr', 'AB', 'multiply') |
| +        | 51.4±0.1ms                 | 64.5±1ms                       |    1.26 | sparse.Arithmetic.time_arithmetic('csc', 'BA', '__sub__')  |
| +        | 51.3±3ms                   | 64.8±1ms                       |    1.26 | sparse.Arithmetic.time_arithmetic('csr', 'AB', '__add__')  |
| +        | 52.7±2ms                   | 66.0±1ms                       |    1.25 | sparse.Arithmetic.time_arithmetic('csc', 'AB', '__sub__')  |
| +        | 50.8±2ms                   | 63.7±2ms                       |    1.25 | sparse.Arithmetic.time_arithmetic('csr', 'BA', '__sub__')  |
| +        | 54.7±0.8ms                 | 67.7±0.9ms                     |    1.24 | sparse.Arithmetic.time_arithmetic('csc', 'AB', 'multiply') |
| +        | 62.5±1ms                   | 76.9±1ms                       |    1.23 | sparse.Arithmetic.time_arithmetic('coo', 'AB', '__add__')  |
| +        | 54.3±0.5ms                 | 66.8±0.9ms                     |    1.23 | sparse.Arithmetic.time_arithmetic('csc', 'BA', 'multiply') |
| +        | 48.2±2ms                   | 58.7±3ms                       |    1.22 | sparse.Arithmetic.time_arithmetic('csr', 'BA', 'multiply') |
| +        | 60.3±0.9ms                 | 73.0±2ms                       |    1.21 | sparse.Arithmetic.time_arithmetic('coo', 'AB', 'multiply') |
| +        | 52.3±3ms                   | 63.4±2ms                       |    1.21 | sparse.Arithmetic.time_arithmetic('csr', 'BA', '__add__')  |
| +        | 62.3±2ms                   | 74.6±2ms                       |    1.2  | sparse.Arithmetic.time_arithmetic('coo', 'AB', '__sub__')  |
| +        | 61.9±1ms                   | 73.6±2ms                       |    1.19 | sparse.Arithmetic.time_arithmetic('coo', 'BA', '__sub__')  |
| +        | 54.2±0.8ms                 | 64.5±1ms                       |    1.19 | sparse.Arithmetic.time_arithmetic('csc', 'AB', '__add__')  |
| +        | 54.0±0.6ms                 | 64.0±0.4ms                     |    1.19 | sparse.Arithmetic.time_arithmetic('csc', 'BA', '__add__')  |
| +        | 65.3±4ms                   | 75.8±2ms                       |    1.16 | sparse.Arithmetic.time_arithmetic('coo', 'BA', 'multiply') |
| +        | 210±6ms                    | 233±7ms                        |    1.11 | sparse.Arithmetic.time_arithmetic('dia', 'BA', '__sub__')  |
| +        | 67.4±2ms                   | 74.2±2ms                       |    1.1  | sparse.Arithmetic.time_arithmetic('coo', 'BA', '__add__')  |
| +        | 209±5ms                    | 222±8ms                        |    1.06 | sparse.Arithmetic.time_arithmetic('dia', 'AB', '__add__')  |
| +        | 213±6ms                    | 225±7ms                        |    1.06 | sparse.Arithmetic.time_arithmetic('dia', 'BA', 'multiply') |
| -        | 59.3±2ms                   | 53.9±0.6ms                     |    0.91 | sparse.Arithmetic.time_arithmetic('coo', 'BB', '__add__')  |
| -        | 60.4±0.9ms                 | 54.6±0.6ms                     |    0.9  | sparse.Arithmetic.time_arithmetic('csc', 'BB', '__add__')  |
| -        | 59.3±0.5ms                 | 53.4±0.1ms                     |    0.9  | sparse.Arithmetic.time_arithmetic('dia', 'BB', '__add__')  |
| -        | 53.1±2ms                   | 47.1±2ms                       |    0.89 | sparse.Arithmetic.time_arithmetic('coo', 'AA', '__add__')  |
| -        | 26.8±0.6ms                 | 23.8±0.2ms                     |    0.89 | sparse.Arithmetic.time_arithmetic('csc', 'AA', 'multiply') |
| -        | 26.9±0.3ms                 | 20.9±0.2ms                     |    0.78 | sparse.Arithmetic.time_arithmetic('csc', 'AA', '__add__')  |
| -        | 23.5±0.3ms                 | 18.3±0.2ms                     |    0.78 | sparse.Arithmetic.time_arithmetic('csr', 'AA', '__add__')  |

B should be in canonical form there, but the results hint at the new single-pass check for that failing for __sub__ and multiply. At least, otherwise I can't explain why those results regress here. I'd also expect the __sub__ and __add__ results to be symmetric, so something isn't quite right yet it looks like.

There's also a all of -Wmaybe-uninitialized compiler warnings - it'd be good to address those, and perhaps that catches something.

Running the bench_sparse_add.py script from the PR description, the results improve but not as much as in the PR description, more like in the 0% - 10% range for me:

random-50,000:  0.00027637499442789704
poisson-49,600:         0.00023411099391523749
random-500,000:         0.001635348002309911
poisson-498,016:        0.0017717820010147989
random-5,000,000:       0.023091215000022203
poisson-4,996,000:      0.02406818899908103
random-50,000,000:      0.22385444400424603
poisson-49,978,572:     0.23443531700468156

# main:
random-50,000:  0.00038524300907738507
poisson-49,600:         0.00021164999634493142
random-500,000:         0.00149274300201796
poisson-498,016:        0.0016441019979538396
random-5,000,000:       0.022221983992494643
poisson-4,996,000:      0.025533260995871387
random-50,000,000:      0.21116333699319512
poisson-49,978,572:     0.2499628399964422

# main run 2:
random-50,000:  0.00038719200529158115
poisson-49,600:         0.00022731799981556833
random-500,000:         0.001550414992379956
poisson-498,016:        0.0016773520037531853
random-5,000,000:       0.022551914007635787
poisson-4,996,000:      0.02617731600184925
random-50,000,000:      0.21719301299890503
poisson-49,978,572:     0.2543543719948502

# this PR, run 2:
random-50,000:  0.0002743800141615793
poisson-49,600:         0.00022587399871554226
random-500,000:         0.0016353649989468977
poisson-498,016:        0.0016516460018465295
random-5,000,000:       0.02321807699627243
poisson-4,996,000:      0.023670254988246597
random-50,000,000:      0.2263502830028301
poisson-49,978,572:     0.23783767299028113

I'm not yet sure what to make of all this. Maybe you can run the same asv benchmark @alugowski and see how that looks for you?

@rgommers
Copy link
Member

Repeating the benchmark script on a macOS arm64 (M1) machine, I do see similar results as in the PR description:

# this PR
random-50,000:  0.00017312500131083652
poisson-49,600:         0.00012250000145286322
random-500,000:         0.0010954169993055984
poisson-498,016:        0.0013311670045368373
random-5,000,000:       0.01243333300226368
poisson-4,996,000:      0.013911583999288268
random-50,000,000:      0.12745545800135005
poisson-49,978,572:     0.14511337500152877

# run 2:
random-50,000:  0.00017150000348920003
poisson-49,600:         0.0001229170011356473
random-500,000:         0.0010238329996354878
poisson-498,016:        0.001206166998599656
random-5,000,000:       0.013366833998588845
poisson-4,996,000:      0.013843332999385893
random-50,000,000:      0.12439233400073135
poisson-49,978,572:     0.1430313750024652


# main:
random-50,000:  0.00029024999821558595
poisson-49,600:         0.00013862500054528937
random-500,000:         0.0014803340018261224
poisson-498,016:        0.0016382500034524128
random-5,000,000:       0.015349457993579563
poisson-4,996,000:      0.015778542001498863
random-50,000,000:      0.15929004100325983
poisson-49,978,572:     0.16495433400268666

# run 2:
random-50,000:  0.00027474999660626054
poisson-49,600:         0.0001384170027449727
random-500,000:         0.0013510419958038256
poisson-498,016:        0.0014166249966365285
random-5,000,000:       0.016083499998785555
poisson-4,996,000:      0.01596837499528192
random-50,000,000:      0.15811200000462122
poisson-49,978,572:     0.16117887500149664

It looks like the trade-off here depends pretty strongly on CPU and memory architecture.

@perimosocordiae perimosocordiae self-requested a review March 4, 2024 19:34
@perimosocordiae
Copy link
Member

I just pushed a commit to clean up the implementation a bit: I removed some extra variables and ensured the ones that remain are always initialized.

I haven't run the benchmarks locally yet, but I definitely want to understand why we'd see regressions for __sub__ but not __add__ here.

@perimosocordiae
Copy link
Member

ASV benchmarks without the changes to ensure canonical format ahead of time:

Change Before [8ce393f] After [d9a04dc7] Ratio Benchmark (Parameter)
+ 3.48±0ms 4.04±0ms 1.16 sparse.Arithmetic.time_arithmetic('dia', 'BB', 'add')
+ 3.58±0ms 4.07±0ms 1.14 sparse.Arithmetic.time_arithmetic('dia', 'BB', 'multiply')
+ 4.97±0ms 5.36±0ms 1.08 sparse.Arithmetic.time_arithmetic('csc', 'BA', 'add')
+ 13.6±0ms 14.3±0ms 1.05 sparse.Arithmetic.time_arithmetic('csc', 'BA', 'mul')
- 9.72±0ms 9.19±0ms 0.95 sparse.Arithmetic.time_arithmetic('coo', 'AB', 'add')
- 11.4±0ms 10.7±0ms 0.94 sparse.Arithmetic.time_arithmetic('coo', 'AA', 'add')
- 11.5±0ms 10.7±0ms 0.94 sparse.Arithmetic.time_arithmetic('coo', 'AA', 'multiply')
- 9.68±0ms 9.13±0ms 0.94 sparse.Arithmetic.time_arithmetic('coo', 'AB', 'sub')
- 6.65±0ms 6.27±0ms 0.94 sparse.Arithmetic.time_arithmetic('csr', 'AA', 'mul')
- 9.96±0ms 9.30±0ms 0.93 sparse.Arithmetic.time_arithmetic('coo', 'BA', 'sub')
- 4.61±0ms 4.28±0ms 0.93 sparse.Arithmetic.time_arithmetic('csr', 'AB', 'add')
- 10.4±0ms 9.66±0ms 0.93 sparse.Arithmetic.time_arithmetic('dia', 'AB', 'multiply')
- 17.6±0ms 16.1±0ms 0.91 sparse.Arithmetic.time_arithmetic('coo', 'AA', 'mul')
- 4.06±0ms 3.64±0ms 0.9 sparse.Arithmetic.time_arithmetic('coo', 'BB', 'sub')
- 4.70±0ms 4.20±0ms 0.89 sparse.Arithmetic.time_arithmetic('csr', 'AB', 'sub')
- 4.69±0ms 4.20±0ms 0.89 sparse.Arithmetic.time_arithmetic('csr', 'BA', 'sub')
- 3.57±0ms 3.16±0ms 0.89 sparse.Arithmetic.time_arithmetic('dia', 'BB', 'sub')
- 4.24±0ms 3.69±0ms 0.87 sparse.Arithmetic.time_arithmetic('csc', 'BB', 'sub')
- 3.63±0ms 3.14±0ms 0.86 sparse.Arithmetic.time_arithmetic('csr', 'BB', 'sub')
- 10.9±0ms 9.19±0ms 0.84 sparse.Arithmetic.time_arithmetic('coo', 'AA', 'sub')
- 2.03±0ms 1.54±0ms 0.76 sparse.Arithmetic.time_arithmetic('csr', 'AA', 'multiply')
- 998±0μs 708±0μs 0.71 sparse.Arithmetic.time_arithmetic('csr', 'AA', 'sub')

@perimosocordiae
Copy link
Member

perimosocordiae commented May 15, 2024

The benchmark changes for DIA are almost certainly noise, because adding two DIA-format matrices together doesn't involve conversion to CSR or CSC, and doesn't use sparsetools.

EDIT: I was mistaken here. The B matrix is defined as A**2, which converts from DIA to CSR. The results are still noisy due to running ASV with the --quick flag set by default, though.

@perimosocordiae
Copy link
Member

Here are my ASV results after removing the --quick flag and increasing the number of rounds to 4:

Change Before [8ce393f] After [d9a04dc7] Ratio Benchmark (Parameter)
+ 3.49±0.02ms 4.08±0.06ms 1.17 sparse.Arithmetic.time_arithmetic('csr', 'AB', 'multiply')
- 19.2±0.4ms 18.2±0.4ms 0.95 sparse.Arithmetic.time_arithmetic('dia', 'AA', 'multiply')
- 14.5±0.5ms 13.7±0.4ms 0.94 sparse.Arithmetic.time_arithmetic('coo', 'AA', 'mul')
- 6.21±0.03ms 5.81±0.03ms 0.94 sparse.Arithmetic.time_arithmetic('coo', 'AB', 'add')
- 1.34±0.02ms 1.26±0.01ms 0.94 sparse.Arithmetic.time_arithmetic('csc', 'AA', 'sub')
- 8.77±0.06ms 8.28±0.06ms 0.94 sparse.Arithmetic.time_arithmetic('dia', 'AB', 'sub')
- 1.00±0ms 932±5μs 0.93 sparse.Arithmetic.time_arithmetic('csr', 'AA', 'add')
- 3.25±0.02ms 3.03±0.02ms 0.93 sparse.Arithmetic.time_arithmetic('csr', 'AB', 'add')
- 8.87±0.05ms 8.23±0.05ms 0.93 sparse.Arithmetic.time_arithmetic('dia', 'BA', 'sub')
- 6.24±0.04ms 5.76±0.02ms 0.92 sparse.Arithmetic.time_arithmetic('coo', 'AB', 'sub')
- 3.26±0.01ms 2.98±0.02ms 0.91 sparse.Arithmetic.time_arithmetic('csr', 'AB', 'sub')
- 7.27±0.3ms 6.56±0.2ms 0.9 sparse.Arithmetic.time_arithmetic('coo', 'AA', 'sub')
- 6.37±0.03ms 5.73±0.04ms 0.9 sparse.Arithmetic.time_arithmetic('coo', 'BA', 'sub')
- 3.39±0.02ms 2.98±0.02ms 0.88 sparse.Arithmetic.time_arithmetic('csc', 'BA', 'sub')
- 3.41±0.02ms 2.99±0.01ms 0.88 sparse.Arithmetic.time_arithmetic('csr', 'BA', 'sub')
- 3.59±0.02ms 3.09±0.01ms 0.86 sparse.Arithmetic.time_arithmetic('coo', 'BB', 'sub')
- 3.60±0.05ms 3.09±0.01ms 0.86 sparse.Arithmetic.time_arithmetic('csc', 'BB', 'sub')
- 3.58±0.02ms 3.09±0.02ms 0.86 sparse.Arithmetic.time_arithmetic('csr', 'BB', 'sub')
- 3.58±0.02ms 3.08±0.01ms 0.86 sparse.Arithmetic.time_arithmetic('dia', 'BB', 'sub')
- 1.20±0.01ms 940±5μs 0.79 sparse.Arithmetic.time_arithmetic('csr', 'AA', 'multiply')
- 935±30μs 614±4μs 0.66 sparse.Arithmetic.time_arithmetic('csr', 'AA', 'sub')

Looks like the CSR-AB-multiply benchmark is the only real regression, so I'll take a look into why that's happening next.

@perimosocordiae
Copy link
Member

In the one remaining regression case, A has canonical format and B does not. My current hypothesis for the slowdown is that previously, we would do:

  1. check has_canonical_format(A) -> true in O(A.nnz) time
  2. check has_canonical_format(B) -> false in < O(B.nnz) time
  3. run csr_binop_csr_general(A, B)

Now, we do:

  1. run csr_binop_csr_canonical(A, B) -> false
  2. run csr_binop_csr_general(A, B)

The old checks (1) and (2) are actually rather fast, because A.nnz < B.nnz and we bail out of the B checking very quickly because the indices are very obviously unsorted. They also have good cache locality, as we're iterating linearly through one set of indices at a time.

The new codepath does more work before it can bail out, and it has interleaved accesses for both sets of indices and data arrays, which is a lot less cache-friendly.

But why does this appear as a regression for elementwise multiply but not addition? Maybe because the extra multiplications we do during the speculative canonical-mode call are more costly?

@perimosocordiae
Copy link
Member

One other observation: the benchmarking in the PR description only uses operands in canonical format. I would expect that we'd see less of a speedup (or even a slowdown) if we measured with non-canonical matrices.

@lucascolley lucascolley changed the title ENH: single-pass CSR sparse matrix binops ENH: sparse: single-pass CSR matrix binops May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants