Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add the fused CSR dense case for Euclidean Specializations #25044

Merged

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Nov 25, 2022

Reference Issues/PRs

Towards #22587.
Follow up of #24556.

What does this implement/fix? Explain your changes.

The CSR-dense and the dense-CSR cases were chosen not to be supported for the Euclidean specialization of all PairwiseDistancesReductions (for more context see #23585 (comment)).

This PR implements SparseDenseMiddleTermComputer allows computing the middle term of the distance matrix decomposition for the Euclidean specializations, covering those two missing cases.

Hence this completes all the combinations for the Euclidean specialisations (:tada:).

Any other comments?

Different designs have been explored in other Pull Requests to factor some logic altogether or rethink DatasetsPairs w.r.t. MiddleTermComputer for the Euclidean specialisations.

In overall, this PR seems to have the best tradeoff regarding performance and duplication of code.

Benchmarks

This makes using PairwiseDistancesReductions on the CSR-dense and the dense-CSR for euclidean competitive w.r.t to the previous implementation relying on joblib.

One can get up to ×2 on a laptop.

Details
In [1]: from sklearn.neighbors import NearestNeighbors
   ...: from sklearn.datasets import make_classification
   ...: from scipy.sparse import csr_matrix
   ...: 
   ...: import sklearn

In [2]: X_train, _ = make_classification(n_samples=100_000, n_features=100)
   ...: X_test, _ = make_classification(n_samples=256, n_features=100)
   ...: 
   ...: X_test = csr_matrix(X_test)

In [3]: nn = NearestNeighbors().fit(X_train)

In [4]: %%timeit
   ...: nn.kneighbors(X_test)
   ...: 
   ...: 
899 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %%timeit
   ...: with sklearn.config_context(enable_cython_pairwise_dist=False):
   ...:     nn.kneighbors(X_test)
   ...: 
2.24 s ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: X_train, _ = make_classification(n_samples=256, n_features=100)
   ...: X_test, _ = make_classification(n_samples=100_000, n_features=100)
   ...: 
   ...: X_test = csr_matrix(X_test)

In [7]: nn = NearestNeighbors().fit(X_train)

In [8]: %%timeit
   ...: nn.kneighbors(X_test)
   ...: 
   ...: 
760 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %%timeit
   ...: with sklearn.config_context(enable_cython_pairwise_dist=False):
   ...:     nn.kneighbors(X_test)
   ...: 
1.22 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TODO:

  • Refactor the logic to:
    • Remove the cast on the entire dense array and data array and only upcast chunks using dedicated buffers (as done for DenseDenseMiddleTermComputer64
    • Redesign DatasetsPair w.r.t the new MiddleTermComputer to remove the duplicated logic and to clarify responsabilities (esp. squared euclidean norm computations)
  • Add more tests?

@jjerphan jjerphan force-pushed the enh/pdr-fused-sparse-dense-euclidean branch from a943f79 to 0243c1d Compare February 1, 2023 10:26
@jjerphan jjerphan marked this pull request as ready for review February 27, 2023 10:55
jjerphan and others added 2 commits February 27, 2023 12:09
This was already studied in:
scikit-learn#25449

Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
@jjerphan
Copy link
Member Author

jjerphan commented Feb 27, 2023

Hi @OmarManzoor, @Vincent-Maladiere, @Micky774 and @adam2392: might some of you might be interested in reviewing this PR? 🙂

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Feb 27, 2023

Hi @OmarManzoor, @Vincent-Maladiere and @adam2392: might some of you might be interested in reviewing this PR? 🙂

Sure! I might need some time to understand the context of the PR though.

@jjerphan
Copy link
Member Author

I am wondering about the kind of tests we could add. Should we add tests on combinations of sparse and dense datasets for all the interfaces? Should we add more of them?

@jjerphan
Copy link
Member Author

jjerphan commented Feb 27, 2023

I might need some time to understand the context of the PR though.

Yes, feel free to take your time and do not feel pressured (the class hierarchy is rather dense and its implementations Tempita-heavy).

This is required with Cython>=3.0.
@jjerphan
Copy link
Member Author

Should I write more about the design?

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Maybe we can enhance the tests by including the combinations within test_sqeuclidean_row_norms and test_pairwise_distances_argkmin.

jjerphan and others added 2 commits February 28, 2023 14:26
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.3.rst Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

and an extra +1 for the opportunity to remove the lines with the XOR operator which were not easy to reason about :)

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Mar 1, 2023

Add more tests?

I think there are enough tests in this PR and the already existing tests in main.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jjerphan jjerphan added Waiting for Second Reviewer First reviewer is done, need a second one! and removed Waiting for Reviewer labels Mar 1, 2023
Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
@jjerphan jjerphan force-pushed the enh/pdr-fused-sparse-dense-euclidean branch from 6a3f58c to 45d425a Compare March 9, 2023 09:38
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
PS: I had to find something:smirk:

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
Comment on lines -124 to -126
# uses the Squared Euclidean matrix decomposition, i.e.:
#
# ||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||²
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this "trick" still documented somewhere in the pairwise distances codebase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is given here:

cdef class MiddleTermComputer{{name_suffix}}:
"""Helper class to compute a Euclidean distance matrix in chunks.
This is an abstract base class that is further specialized depending
on the type of data (dense or sparse).
`EuclideanDistance` subclasses relies on the squared Euclidean
distances between chunks of vectors X_c and Y_c using the
following decomposition for the (i,j) pair :
||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||²
This helper class is in charge of wrapping the common logic to compute
the middle term, i.e. `- 2 X_c_i.Y_c_j^T`.
"""

I am thinking of writing a couple of notes for the design of sklearn.metrics._pairwise_distances_reduction. What do you think of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great. Not too many details, but giving a good overview.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@lorentzenchr
Copy link
Member

Which co-authors should I keep when merging?

@jjerphan
Copy link
Member Author

jjerphan commented Mar 10, 2023

Which co-authors should I keep when merging?

All, I guess? I let this choice to the maintainer merging.

@jjerphan jjerphan added Performance and removed Waiting for Second Reviewer First reviewer is done, need a second one! No Changelog Needed labels Mar 10, 2023
@lorentzenchr lorentzenchr enabled auto-merge (squash) March 10, 2023 12:24
@lorentzenchr lorentzenchr merged commit 67ea720 into scikit-learn:main Mar 10, 2023
@jjerphan jjerphan deleted the enh/pdr-fused-sparse-dense-euclidean branch March 10, 2023 12:32
@jjerphan
Copy link
Member Author

Thank you for your reviews, @OmarManzoor, @ogrisel, and @lorentzenchr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants