Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged
merged 83 commits into from
Sep 20, 2022
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
b8bd875
MAINT Implement CSR support for all DistanceMetric
jjerphan Jun 11, 2022
7b07188
Merge branch 'main' into maint/dist-metrics-csr-support
jjerphan Jun 14, 2022
fb99680
TST Remove useless guard
jjerphan Jun 15, 2022
d39d2b2
TST Skip JaccardDistance on 32bit architecture
jjerphan Jun 15, 2022
011e2a2
MAINT Define dtype alias for sparse matrices indices
jjerphan Jun 16, 2022
a579630
MAINT Do not shadow dtype names in Tempita templating
jjerphan Jun 16, 2022
98e9d21
fixup! MAINT Define dtype alias for sparse matrices indices
jjerphan Jun 16, 2022
8aa4e44
TST Use cdist and pdist appropriately
jjerphan Jun 16, 2022
9edfa11
DOC Improve comments
jjerphan Jun 17, 2022
ee5c6bf
Fixups
jjerphan Jun 17, 2022
bf5eb59
MAINT Wrap of indptr values to support sparse-dense
jjerphan Jun 17, 2022
92b8a6c
Apply review comments
jjerphan Jun 17, 2022
dc6f8cf
More interesting boolean data for tests
ogrisel Jun 17, 2022
bb06f59
FIX Various corrections
jjerphan Jun 17, 2022
a5eb20d
FIX Make Jaccard, Hamming and Hashing robust to explicit zeros
jjerphan Jun 17, 2022
19edf11
FIX Make the other boolean DistanceMetric also robust to explicit zeros
jjerphan Jun 17, 2022
de86802
TST Remove xfail for Jaccard on 32bit arch.
jjerphan Jun 17, 2022
bb920cf
Cast to np.float64_t where appropriate
jjerphan Jun 17, 2022
b3759fe
Rename methods and correctly format their signatures
jjerphan Jun 20, 2022
7f89236
fixup! TST Remove xfail for Jaccard on 32bit arch.
jjerphan Jun 20, 2022
01a0c33
FEA CSR support for HaversineDistance
jjerphan Jun 20, 2022
7d8a717
Fix typo
jjerphan Jun 22, 2022
563e359
Do not upcast to 64bit yet keep the same precision
jjerphan Jun 22, 2022
f863a51
Do use the default rtol
jjerphan Jun 22, 2022
5ba0fbe
Set rtol explicitly in test_distance_metrics_dtype_consistency
ogrisel Jun 22, 2022
4f45839
Implement the sparse-dense and the dense-sparse case for c-contiguity
jjerphan Jun 23, 2022
3e3e888
Add validation on X and Y, accepting CSR as inputs
jjerphan Jun 23, 2022
ddc49d5
Remove left-overs
jjerphan Jun 23, 2022
a83887c
Merge branch 'main' into maint/dist-metrics-csr-support
jjerphan Jun 24, 2022
e8bb70a
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Jul 2, 2022
dec0aa8
Add support for all combinations of {dense,sparse} datasets pairs
jjerphan Jul 2, 2022
63c6fe3
Const-qualify X and Y
jjerphan Jul 5, 2022
0bb368f
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Jul 5, 2022
30e84af
Only pass Y_norm_squared for the dense-dense case
jjerphan Jul 5, 2022
780d7bb
Update comments
jjerphan Jul 6, 2022
72f4ae7
Pop unused keywords arguments
jjerphan Jul 19, 2022
a1ce042
Remove unused import
jjerphan Jul 19, 2022
713b932
DOC Add whats_new entry
jjerphan Jul 19, 2022
ac6208d
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Jul 19, 2022
5570f71
fixup! Pop unused keywords arguments
jjerphan Jul 19, 2022
4c455ea
DOC Update comment and changelog
jjerphan Jul 20, 2022
0f0ea70
MAINT Test second alternative for sparse-dense support
jjerphan Jul 21, 2022
80b8c02
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 8, 2022
a3cf4d8
DOC Update comment and changelog
jjerphan Jul 20, 2022
e9da38b
Merge branch 'maint/pdr-sparse-support' into alt/feat/pdr-sparse-support
jjerphan Aug 9, 2022
e9ecbbc
DOC Improve comments and code self-documentation
jjerphan Aug 10, 2022
c8bacc6
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 10, 2022
180a54e
Merge branch 'maint/pdr-sparse-support' into alt/feat/pdr-sparse-support
jjerphan Aug 10, 2022
de52371
TST `dtype`-parametrize `test_format_agnosticism`
jjerphan Aug 11, 2022
3243153
fixup! TST `dtype`-parametrize `test_format_agnosticism`
jjerphan Aug 11, 2022
e521992
MNT Pushing data up instead of indices
thomasjpfan Aug 11, 2022
972fff9
DOC Improve comment
thomasjpfan Aug 11, 2022
3086c0b
REV Revert back to memoryviews for indices
thomasjpfan Aug 11, 2022
afa0c35
DOC Spelling
thomasjpfan Aug 11, 2022
bc78747
Merge pull request #16 from thomasjpfan/alt/feat/pdr-sparse-support
jjerphan Aug 11, 2022
bd48ef0
Merge pull request #15 from jjerphan/alt/feat/pdr-sparse-support
jjerphan Aug 11, 2022
be59297
TST Suggest logic adaptation for _pairwise_{dense_sparse,sparse_dense}
jjerphan Aug 22, 2022
f8ab496
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 25, 2022
ec1d4f9
DOC Add co-authors in `whats_new` entry
jjerphan Aug 25, 2022
fbf311e
Do not support CSR matrices without non-zero elements
jjerphan Aug 26, 2022
511c6e6
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 28, 2022
8fddffd
fixup! Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 28, 2022
4b879f1
MAINT Do not pop Y_norm_squared when unused
jjerphan Aug 29, 2022
0b1ce13
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Sep 9, 2022
ca49236
Explicitly do not support CSR matrices with int64 indices and indptr
jjerphan Sep 9, 2022
d7b3649
DOC Update and improve comment for the alternative CSR representation
jjerphan Sep 9, 2022
5e13663
fixup! DOC Update and improve comment for the alternative CSR represe…
jjerphan Sep 9, 2022
1de8acb
CI Retrigger CI due to faulty runs
jjerphan Sep 9, 2022
7766388
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Sep 12, 2022
8f43a5a
DOC Update whats_new entry
jjerphan Sep 12, 2022
a229b35
Test and document Isomap on sparse data
jjerphan Sep 12, 2022
3e357a2
Test and document TSNE on sparse data
jjerphan Sep 12, 2022
9723115
Test and document pairwise_distances_argmin on sparse data
jjerphan Sep 12, 2022
faf704a
Test and document LocalOutlierFactor on sparse data
jjerphan Sep 12, 2022
c66bb82
DOC Add support for sparse data for NearestNeighbors, KNeighbors*, Ra…
jjerphan Sep 12, 2022
1eb5b2c
DOC Remove formatting change
jjerphan Sep 12, 2022
fcf15b6
TST Do not test on full cartesian product
jjerphan Sep 15, 2022
58453d7
fixup! TST Do not test on full cartesian product
jjerphan Sep 15, 2022
63fda8c
TST Add TODO for consistency checks on results for sparse and dense data
jjerphan Sep 15, 2022
1d7bcc7
MAINT Mark PairwiseDistancesReductions as unusable for some config.
jjerphan Sep 15, 2022
fec55bf
fixup! MAINT Mark PairwiseDistancesReductions as unusable for some co…
jjerphan Sep 15, 2022
d55bcec
TST Improve test_format_agnosticism
jjerphan Sep 20, 2022
c21187a
DOC Update comment regarding the use of pairwise_distances_chunked
jjerphan Sep 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 20 additions & 0 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,26 @@ Changes impacting all modules
second-pass algorithm.
:pr:`23197` by :user:`Meekail Zain <micky774>`

- |Enhancement| Support for combinations of dense and sparse datasets pairs
for all distance metrics has been added on the following estimators:

- :func:`sklearn.metrics.pairwise_distances_argmin`
- :func:`sklearn.metrics.pairwise_distances_argmin_min`
- :class:`sklearn.cluster.AffinityPropagation`
- :class:`sklearn.cluster.Birch`
- :class:`sklearn.cluster.SpectralClustering`
- :class:`sklearn.neighbors.KNeighborsClassifier`
- :class:`sklearn.neighbors.KNeighborsRegressor`
- :class:`sklearn.neighbors.RadiusNeighborsClassifier`
- :class:`sklearn.neighbors.RadiusNeighborsRegressor`
- :class:`sklearn.neighbors.LocalOutlierFactor`
- :class:`sklearn.neighbors.NearestNeighbors`
- :class:`sklearn.manifold.Isomap`
- :class:`sklearn.manifold.TSNE`
- :func:`sklearn.manifold.trustworthiness`

:pr:`23585` by `Julien Jerphanion <jjerphan>`

Changelog
---------

Expand Down
10 changes: 8 additions & 2 deletions sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,14 @@ cdef class PairwiseDistancesArgKmin64(PairwiseDistancesReduction64):
metric_kwargs=metric_kwargs,
)
else:
# Fall back on a generic implementation that handles most scipy
# metrics by computing the distances between 2 vectors at a time.
# Fall back on a generic implementation that handles all distance
# metrics by computing it between 2 vectors at a time.

# The extra `Y_norm_squared` argument for the back-end is only
# supported for the FastEuclidean variant.
if metric_kwargs is not None:
metric_kwargs.pop("Y_norm_squared", None)

pda = PairwiseDistancesArgKmin64(
datasets_pair=DatasetsPair.get_for(X, Y, metric, metric_kwargs),
k=k,
Expand Down
37 changes: 34 additions & 3 deletions sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from ...utils._typedefs cimport DTYPE_t, ITYPE_t
from ...utils._typedefs cimport DTYPE_t, ITYPE_t, SPARSE_INDEX_TYPE_t
from ...metrics._dist_metrics cimport DistanceMetric


cdef class DatasetsPair:
cdef DistanceMetric distance_metric
cdef:
DistanceMetric distance_metric
ITYPE_t n_features

cdef ITYPE_t n_samples_X(self) nogil

Expand All @@ -18,4 +20,33 @@ cdef class DenseDenseDatasetsPair(DatasetsPair):
cdef:
const DTYPE_t[:, ::1] X
const DTYPE_t[:, ::1] Y
ITYPE_t d


cdef class SparseSparseDatasetsPair(DatasetsPair):
cdef:
const DTYPE_t[:] X_data
const SPARSE_INDEX_TYPE_t[:] X_indices
const SPARSE_INDEX_TYPE_t[:] X_indptr

const DTYPE_t[:] Y_data
const SPARSE_INDEX_TYPE_t[:] Y_indices
const SPARSE_INDEX_TYPE_t[:] Y_indptr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to Tempita all of these? Specifically, DTYPE_t to become float32 and float64 and SPARSE_INDEX_TYPE_t to become int32 and int64.

If so, does this extend to all combinations? For example, sparse data with float32 data and int64 indices paired with a sparse data with float64 data and int32 indices. 🤯

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially yes, especially for DTYPE_t which will remain fixed datasetpair-wise.

I don't see how to easily handle various the case on SPARSE_INDEX_TYPE_t yet/using Cython, and I fear the complex logic and manual templating…

Other languages would have allowed to ease some logic, but going done this road come after broader discussions IMO.



cdef class SparseDenseDatasetsPair(DatasetsPair):
cdef:
const DTYPE_t[:] X_data
const SPARSE_INDEX_TYPE_t[:] X_indices
const SPARSE_INDEX_TYPE_t[:] X_indptr

const DTYPE_t[:] Y_data
const SPARSE_INDEX_TYPE_t[:] Y_indices
ITYPE_t n_Y


cdef class DenseSparseDatasetsPair(DatasetsPair):
cdef:
# As distance metrics are commutative, we can simply rely
# on the implementation of SparseDenseDatasetsPair and
# swap arguments.
DatasetsPair datasets_pair
220 changes: 209 additions & 11 deletions sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@ import numpy as np
cimport numpy as cnp

from cython cimport final
from scipy.sparse import issparse
from scipy.sparse import issparse, csr_matrix

from ...utils._typedefs cimport DTYPE_t, ITYPE_t
from ...metrics._dist_metrics cimport DistanceMetric

from ...utils._typedefs import DTYPE, SPARSE_INDEX_TYPE


cnp.import_array()

cdef class DatasetsPair:
Expand Down Expand Up @@ -91,14 +94,32 @@ cdef class DatasetsPair:
distance_metric._validate_data(X)
distance_metric._validate_data(Y)

# TODO: dispatch to other dataset pairs for sparse support once available:
if issparse(X) or issparse(Y):
raise ValueError("Only dense datasets are supported for X and Y.")
X_is_sparse = issparse(X)
Y_is_sparse = issparse(Y)

if not X_is_sparse and not Y_is_sparse:
return DenseDenseDatasetsPair(X, Y, distance_metric)

return DenseDenseDatasetsPair(X, Y, distance_metric)
if X_is_sparse and Y_is_sparse:
return SparseSparseDatasetsPair(X, Y, distance_metric)

def __init__(self, DistanceMetric distance_metric):
if X_is_sparse and not Y_is_sparse:
return SparseDenseDatasetsPair(X, Y, distance_metric)

return DenseSparseDatasetsPair(X, Y, distance_metric)

@classmethod
def unpack_csr_matrix(cls, X: csr_matrix):
"""Ensure that the CSR matrix is indexed with SPARSE_INDEX_TYPE."""
# TODO: leave X.data unchanged once float32 is supported.
X_data = np.asarray(X.data, dtype=DTYPE)
jjerphan marked this conversation as resolved.
Show resolved Hide resolved
X_indices = np.asarray(X.indices, dtype=SPARSE_INDEX_TYPE)
X_indptr = np.asarray(X.indptr, dtype=SPARSE_INDEX_TYPE)
return X_data, X_indices, X_indptr

def __init__(self, DistanceMetric distance_metric, ITYPE_t n_features):
self.distance_metric = distance_metric
self.n_features = n_features

cdef ITYPE_t n_samples_X(self) nogil:
"""Number of samples in X."""
Expand Down Expand Up @@ -140,12 +161,16 @@ cdef class DenseDenseDatasetsPair(DatasetsPair):
between two row vectors of (X, Y).
"""

def __init__(self, X, Y, DistanceMetric distance_metric):
super().__init__(distance_metric)
def __init__(
self,
const DTYPE_t[:, ::1] X,
const DTYPE_t[:, ::1] Y,
DistanceMetric distance_metric,
):
super().__init__(distance_metric, n_features=X.shape[1])
# Arrays have already been checked
self.X = X
self.Y = Y
self.d = X.shape[1]

@final
cdef ITYPE_t n_samples_X(self) nogil:
Expand All @@ -157,8 +182,181 @@ cdef class DenseDenseDatasetsPair(DatasetsPair):

@final
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.rdist(&self.X[i, 0], &self.Y[j, 0], self.d)
return self.distance_metric.rdist(&self.X[i, 0], &self.Y[j, 0], self.n_features)

@final
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.dist(&self.X[i, 0], &self.Y[j, 0], self.n_features)


@final
cdef class SparseSparseDatasetsPair(DatasetsPair):
"""Compute distances between vectors of two CSR matrices.

Parameters
----------
X: sparse matrix of shape (n_samples_X, n_features)
Rows represent vectors. Must be in CSR format.

Y: sparse matrix of shape (n_samples_Y, n_features)
Rows represent vectors. Must be in CSR format.

distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two vectors of (X, Y).
"""

def __init__(self, X, Y, DistanceMetric distance_metric):
super().__init__(distance_metric, n_features=X.shape[1])

self.X_data, self.X_indices, self.X_indptr = self.unpack_csr_matrix(X)
self.Y_data, self.Y_indices, self.Y_indptr = self.unpack_csr_matrix(Y)

@final
cdef ITYPE_t n_samples_X(self) nogil:
return self.X_indptr.shape[0] - 1

@final
cdef ITYPE_t n_samples_Y(self) nogil:
return self.Y_indptr.shape[0] -1

@final
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.rdist_csr(
x1_data=self.X_data,
x1_indices=self.X_indices,
x2_data=self.Y_data,
x2_indices=self.Y_indices,
x1_start=self.X_indptr[i],
x1_end=self.X_indptr[i + 1],
x2_start=self.Y_indptr[j],
x2_end=self.Y_indptr[j + 1],
size=self.n_features,
)

@final
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.dist_csr(
x1_data=self.X_data,
x1_indices=self.X_indices,
x2_data=self.Y_data,
x2_indices=self.Y_indices,
x1_start=self.X_indptr[i],
x1_end=self.X_indptr[i + 1],
x2_start=self.Y_indptr[j],
x2_end=self.Y_indptr[j + 1],
size=self.n_features,
)


@final
cdef class SparseDenseDatasetsPair(DatasetsPair):
"""Compute distances between vectors of a CSR matrix and a dense array.

Parameters
----------
X: sparse matrix of shape (n_samples_X, n_features)
Rows represent vectors. Must be in CSR format.

Y: ndarray of shape (n_samples_Y, n_features)
Rows represent vectors. Must be C-contiguous.

distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two vectors of (X, Y).
"""

def __init__(self, X, Y, DistanceMetric distance_metric):
super().__init__(distance_metric, n_features=X.shape[1])

self.X_data, self.X_indices, self.X_indptr = self.unpack_csr_matrix(X)

# Y array already has been checked here
self.n_Y = Y.shape[0]
self.Y_data = np.ravel(Y)

# Since Y vectors are dense, we can use a single array
# of indices of self.n_features elements instead of
# a self.n_Y × self.n_features matrix.
# The implementations of DistanceMetric.{dist_csr,rdist_csr}
# support this representation.
self.Y_indices = np.arange(self.n_features, dtype=SPARSE_INDEX_TYPE)

@final
cdef ITYPE_t n_samples_X(self) nogil:
return self.X_indptr.shape[0] - 1

@final
cdef ITYPE_t n_samples_Y(self) nogil:
return self.n_Y

@final
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.rdist_csr(
x1_data=self.X_data,
x1_indices=self.X_indices,
x2_data=self.Y_data,
x2_indices=self.Y_indices,
x1_start=self.X_indptr[i],
x1_end=self.X_indptr[i + 1],
x2_start=j * self.n_features,
x2_end=(j + 1) * self.n_features,
size=self.n_features,
)

@final
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.dist_csr(
x1_data=self.X_data,
x1_indices=self.X_indices,
x2_data=self.Y_data,
x2_indices=self.Y_indices,
x1_start=self.X_indptr[i],
x1_end=self.X_indptr[i + 1],
x2_start=j * self.n_features,
x2_end=(j + 1) * self.n_features,
size=self.n_features,
)


@final
cdef class DenseSparseDatasetsPair(DatasetsPair):
"""Compute distances between vectors of a dense array and a CSR matrix.

Parameters
----------
X: ndarray of shape (n_samples_X, n_features)
Rows represent vectors. Must be C-contiguous.

Y: sparse matrix of shape (n_samples_Y, n_features)
Rows represent vectors. Must be in CSR format.

distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two vectors of (X, Y).
"""

def __init__(self, X, Y, DistanceMetric distance_metric):
super().__init__(distance_metric, n_features=X.shape[1])
# Swapping arguments on the constructor
self.datasets_pair = SparseDenseDatasetsPair(Y, X, distance_metric)

@final
cdef ITYPE_t n_samples_X(self) nogil:
# Swapping interface
return self.datasets_pair.n_samples_Y()

@final
cdef ITYPE_t n_samples_Y(self) nogil:
# Swapping interface
return self.datasets_pair.n_samples_X()

@final
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
# Swapping arguments on the same interface
return self.datasets_pair.surrogate_dist(j, i)

@final
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.dist(&self.X[i, 0], &self.Y[j, 0], self.d)
# Swapping arguments on the same interface
return self.datasets_pair.dist(j, i)
3 changes: 0 additions & 3 deletions sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import numpy as np

from typing import List
from scipy.sparse import issparse
from .._dist_metrics import BOOL_METRICS, METRIC_MAPPING

from ._base import _sqeuclidean_row_norms64
Expand Down Expand Up @@ -82,8 +81,6 @@ def is_usable_for(cls, X, Y, metric) -> bool:
dtypes_validity = X.dtype == Y.dtype == np.float64
return (
get_config().get("enable_cython_pairwise_dist", True)
and not issparse(X)
and not issparse(Y)
and dtypes_validity
and metric in cls.valid_metrics()
)
Expand Down