Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged
merged 83 commits into from
Sep 20, 2022
Merged
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
b8bd875
MAINT Implement CSR support for all DistanceMetric
jjerphan Jun 11, 2022
7b07188
Merge branch 'main' into maint/dist-metrics-csr-support
jjerphan Jun 14, 2022
fb99680
TST Remove useless guard
jjerphan Jun 15, 2022
d39d2b2
TST Skip JaccardDistance on 32bit architecture
jjerphan Jun 15, 2022
011e2a2
MAINT Define dtype alias for sparse matrices indices
jjerphan Jun 16, 2022
a579630
MAINT Do not shadow dtype names in Tempita templating
jjerphan Jun 16, 2022
98e9d21
fixup! MAINT Define dtype alias for sparse matrices indices
jjerphan Jun 16, 2022
8aa4e44
TST Use cdist and pdist appropriately
jjerphan Jun 16, 2022
9edfa11
DOC Improve comments
jjerphan Jun 17, 2022
ee5c6bf
Fixups
jjerphan Jun 17, 2022
bf5eb59
MAINT Wrap of indptr values to support sparse-dense
jjerphan Jun 17, 2022
92b8a6c
Apply review comments
jjerphan Jun 17, 2022
dc6f8cf
More interesting boolean data for tests
ogrisel Jun 17, 2022
bb06f59
FIX Various corrections
jjerphan Jun 17, 2022
a5eb20d
FIX Make Jaccard, Hamming and Hashing robust to explicit zeros
jjerphan Jun 17, 2022
19edf11
FIX Make the other boolean DistanceMetric also robust to explicit zeros
jjerphan Jun 17, 2022
de86802
TST Remove xfail for Jaccard on 32bit arch.
jjerphan Jun 17, 2022
bb920cf
Cast to np.float64_t where appropriate
jjerphan Jun 17, 2022
b3759fe
Rename methods and correctly format their signatures
jjerphan Jun 20, 2022
7f89236
fixup! TST Remove xfail for Jaccard on 32bit arch.
jjerphan Jun 20, 2022
01a0c33
FEA CSR support for HaversineDistance
jjerphan Jun 20, 2022
7d8a717
Fix typo
jjerphan Jun 22, 2022
563e359
Do not upcast to 64bit yet keep the same precision
jjerphan Jun 22, 2022
f863a51
Do use the default rtol
jjerphan Jun 22, 2022
5ba0fbe
Set rtol explicitly in test_distance_metrics_dtype_consistency
ogrisel Jun 22, 2022
4f45839
Implement the sparse-dense and the dense-sparse case for c-contiguity
jjerphan Jun 23, 2022
3e3e888
Add validation on X and Y, accepting CSR as inputs
jjerphan Jun 23, 2022
ddc49d5
Remove left-overs
jjerphan Jun 23, 2022
a83887c
Merge branch 'main' into maint/dist-metrics-csr-support
jjerphan Jun 24, 2022
e8bb70a
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Jul 2, 2022
dec0aa8
Add support for all combinations of {dense,sparse} datasets pairs
jjerphan Jul 2, 2022
63c6fe3
Const-qualify X and Y
jjerphan Jul 5, 2022
0bb368f
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Jul 5, 2022
30e84af
Only pass Y_norm_squared for the dense-dense case
jjerphan Jul 5, 2022
780d7bb
Update comments
jjerphan Jul 6, 2022
72f4ae7
Pop unused keywords arguments
jjerphan Jul 19, 2022
a1ce042
Remove unused import
jjerphan Jul 19, 2022
713b932
DOC Add whats_new entry
jjerphan Jul 19, 2022
ac6208d
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Jul 19, 2022
5570f71
fixup! Pop unused keywords arguments
jjerphan Jul 19, 2022
4c455ea
DOC Update comment and changelog
jjerphan Jul 20, 2022
0f0ea70
MAINT Test second alternative for sparse-dense support
jjerphan Jul 21, 2022
80b8c02
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 8, 2022
a3cf4d8
DOC Update comment and changelog
jjerphan Jul 20, 2022
e9da38b
Merge branch 'maint/pdr-sparse-support' into alt/feat/pdr-sparse-support
jjerphan Aug 9, 2022
e9ecbbc
DOC Improve comments and code self-documentation
jjerphan Aug 10, 2022
c8bacc6
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 10, 2022
180a54e
Merge branch 'maint/pdr-sparse-support' into alt/feat/pdr-sparse-support
jjerphan Aug 10, 2022
de52371
TST `dtype`-parametrize `test_format_agnosticism`
jjerphan Aug 11, 2022
3243153
fixup! TST `dtype`-parametrize `test_format_agnosticism`
jjerphan Aug 11, 2022
e521992
MNT Pushing data up instead of indices
thomasjpfan Aug 11, 2022
972fff9
DOC Improve comment
thomasjpfan Aug 11, 2022
3086c0b
REV Revert back to memoryviews for indices
thomasjpfan Aug 11, 2022
afa0c35
DOC Spelling
thomasjpfan Aug 11, 2022
bc78747
Merge pull request #16 from thomasjpfan/alt/feat/pdr-sparse-support
jjerphan Aug 11, 2022
bd48ef0
Merge pull request #15 from jjerphan/alt/feat/pdr-sparse-support
jjerphan Aug 11, 2022
be59297
TST Suggest logic adaptation for _pairwise_{dense_sparse,sparse_dense}
jjerphan Aug 22, 2022
f8ab496
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 25, 2022
ec1d4f9
DOC Add co-authors in `whats_new` entry
jjerphan Aug 25, 2022
fbf311e
Do not support CSR matrices without non-zero elements
jjerphan Aug 26, 2022
511c6e6
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 28, 2022
8fddffd
fixup! Merge branch 'main' into maint/pdr-sparse-support
jjerphan Aug 28, 2022
4b879f1
MAINT Do not pop Y_norm_squared when unused
jjerphan Aug 29, 2022
0b1ce13
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Sep 9, 2022
ca49236
Explicitly do not support CSR matrices with int64 indices and indptr
jjerphan Sep 9, 2022
d7b3649
DOC Update and improve comment for the alternative CSR representation
jjerphan Sep 9, 2022
5e13663
fixup! DOC Update and improve comment for the alternative CSR represe…
jjerphan Sep 9, 2022
1de8acb
CI Retrigger CI due to faulty runs
jjerphan Sep 9, 2022
7766388
Merge branch 'main' into maint/pdr-sparse-support
jjerphan Sep 12, 2022
8f43a5a
DOC Update whats_new entry
jjerphan Sep 12, 2022
a229b35
Test and document Isomap on sparse data
jjerphan Sep 12, 2022
3e357a2
Test and document TSNE on sparse data
jjerphan Sep 12, 2022
9723115
Test and document pairwise_distances_argmin on sparse data
jjerphan Sep 12, 2022
faf704a
Test and document LocalOutlierFactor on sparse data
jjerphan Sep 12, 2022
c66bb82
DOC Add support for sparse data for NearestNeighbors, KNeighbors*, Ra…
jjerphan Sep 12, 2022
1eb5b2c
DOC Remove formatting change
jjerphan Sep 12, 2022
fcf15b6
TST Do not test on full cartesian product
jjerphan Sep 15, 2022
58453d7
fixup! TST Do not test on full cartesian product
jjerphan Sep 15, 2022
63fda8c
TST Add TODO for consistency checks on results for sparse and dense data
jjerphan Sep 15, 2022
1d7bcc7
MAINT Mark PairwiseDistancesReductions as unusable for some config.
jjerphan Sep 15, 2022
fec55bf
fixup! MAINT Mark PairwiseDistancesReductions as unusable for some co…
jjerphan Sep 15, 2022
d55bcec
TST Improve test_format_agnosticism
jjerphan Sep 20, 2022
c21187a
DOC Update comment regarding the use of pairwise_distances_chunked
jjerphan Sep 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
25 changes: 24 additions & 1 deletion doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,29 @@ Changes impacting all modules
second-pass algorithm.
:pr:`23197` by :user:`Meekail Zain <micky774>`

- |Enhancement| Support for combinations of dense and sparse datasets pairs
for all distance metrics and for float32 and float64 datasets has been added
or has seen its performance improved for the following estimators:

- :func:`sklearn.metrics.pairwise_distances_argmin`
- :func:`sklearn.metrics.pairwise_distances_argmin_min`
- :class:`sklearn.cluster.AffinityPropagation`
- :class:`sklearn.cluster.Birch`
- :class:`sklearn.cluster.SpectralClustering`
- :class:`sklearn.neighbors.KNeighborsClassifier`
- :class:`sklearn.neighbors.KNeighborsRegressor`
- :class:`sklearn.neighbors.RadiusNeighborsClassifier`
- :class:`sklearn.neighbors.RadiusNeighborsRegressor`
- :class:`sklearn.neighbors.LocalOutlierFactor`
- :class:`sklearn.neighbors.NearestNeighbors`
- :class:`sklearn.manifold.Isomap`
- :class:`sklearn.manifold.TSNE`
- :func:`sklearn.manifold.trustworthiness`

:pr:`23604` and :pr:`23585` by :user:`Julien Jerphanion <jjerphan>`,
:user:`Olivier Grisel <ogrisel>`, and `Thomas Fan`_.


Changelog
---------

Expand Down Expand Up @@ -305,7 +328,7 @@ Changelog
- |Fix| Allows `csr_matrix` as input for parameter: `y_true` of
the :func:`metrics.label_ranking_average_precision_score` metric.
:pr:`23442` by :user:`Sean Atukorala <ShehanAT>`

jjerphan marked this conversation as resolved.
Show resolved Hide resolved
- |Fix| :func:`metrics.ndcg_score` will now trigger a warning when the `y_true`
value contains a negative value. Users may still use negative values, but the
result may not be between 0 and 1. Starting in v1.4, passing in negative
Expand Down
8 changes: 4 additions & 4 deletions sklearn/manifold/_isomap.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,9 +330,9 @@ def fit(self, X, y=None):

Parameters
----------
X : {array-like, sparse graph, BallTree, KDTree, NearestNeighbors}
X : {array-like, sparse matrix, BallTree, KDTree, NearestNeighbors}
Sample data, shape = (n_samples, n_features), in the form of a
numpy array, sparse graph, precomputed tree, or NearestNeighbors
numpy array, sparse matrix, precomputed tree, or NearestNeighbors
object.

y : Ignored
Expand All @@ -352,7 +352,7 @@ def fit_transform(self, X, y=None):

Parameters
----------
X : {array-like, sparse graph, BallTree, KDTree}
X : {array-like, sparse matrix, BallTree, KDTree}
Training vector, where `n_samples` is the number of samples
and `n_features` is the number of features.

Expand Down Expand Up @@ -381,7 +381,7 @@ def transform(self, X):

Parameters
----------
X : array-like, shape (n_queries, n_features)
X : {array-like, sparse matrix}, shape (n_queries, n_features)
If neighbors_algorithm='precomputed', X is assumed to be a
distance matrix or a sparse graph of shape
(n_queries, n_samples_fit).
Expand Down
11 changes: 7 additions & 4 deletions sklearn/manifold/_t_sne.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,11 +461,12 @@ def trustworthiness(X, X_embedded, *, n_neighbors=5, metric="euclidean"):

Parameters
----------
X : ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
X : {array-like, sparse matrix} of shape (n_samples, n_features) or \
(n_samples, n_samples)
If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row.

X_embedded : ndarray of shape (n_samples, n_components)
X_embedded : {array-like, sparse matrix} of shape (n_samples, n_components)
Embedding of the training data in low-dimensional space.

n_neighbors : int, default=5
Expand Down Expand Up @@ -1095,7 +1096,8 @@ def fit_transform(self, X, y=None):

Parameters
----------
X : ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
X : {array-like, sparse matrix} of shape (n_samples, n_features) or \
(n_samples, n_samples)
If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row. If the method
is 'exact', X may be a sparse matrix of type 'csr', 'csc'
Expand All @@ -1121,7 +1123,8 @@ def fit(self, X, y=None):

Parameters
----------
X : ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
X : {array-like, sparse matrix} of shape (n_samples, n_features) or \
(n_samples, n_samples)
If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row. If the method
is 'exact', X may be a sparse matrix of type 'csr', 'csc'
Expand Down
22 changes: 11 additions & 11 deletions sklearn/manifold/tests/test_isomap.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,19 +216,19 @@ def test_isomap_clone_bug():
assert model.nbrs_.n_neighbors == n_neighbors


def test_sparse_input():
@pytest.mark.parametrize("eigen_solver", eigen_solvers)
@pytest.mark.parametrize("path_method", path_methods)
def test_sparse_input(eigen_solver, path_method):
X = sparse_rand(100, 3, density=0.1, format="csr")

# Should not error
for eigen_solver in eigen_solvers:
for path_method in path_methods:
clf = manifold.Isomap(
n_components=2,
eigen_solver=eigen_solver,
path_method=path_method,
n_neighbors=8,
)
clf.fit(X)
clf = manifold.Isomap(
n_components=2,
eigen_solver=eigen_solver,
path_method=path_method,
n_neighbors=8,
)
clf.fit(X)
clf.transform(X)


def test_isomap_fit_precomputed_radius_graph():
Expand Down
2 changes: 1 addition & 1 deletion sklearn/manifold/tests/test_t_sne.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ def test_optimization_minimizes_kl_divergence():


@pytest.mark.parametrize("method", ["exact", "barnes_hut"])
def test_fit_csr_matrix(method):
def test_fit_transform_csr_matrix(method):
# X can be a sparse matrix.
rng = check_random_state(0)
X = rng.randn(50, 2)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ implementation_specific_values = [
#
# name_suffix, DistanceMetric, INPUT_DTYPE_t, INPUT_DTYPE
#
# We also use the float64 dtype and C-type names as defined in
# `sklearn.utils._typedefs` to maintain consistency.
# We use DistanceMetric for float64 for backward naming compatibility.
#
('64', 'DistanceMetric', 'DTYPE_t', 'DTYPE'),
('32', 'DistanceMetric32', 'cnp.float32_t', 'np.float32')
Expand All @@ -15,14 +14,16 @@ implementation_specific_values = [
}}
cimport numpy as cnp

from ...utils._typedefs cimport DTYPE_t, ITYPE_t
from ...utils._typedefs cimport DTYPE_t, ITYPE_t, SPARSE_INDEX_TYPE_t
from ...metrics._dist_metrics cimport DistanceMetric, DistanceMetric32

{{for name_suffix, DistanceMetric, INPUT_DTYPE_t, INPUT_DTYPE in implementation_specific_values}}


cdef class DatasetsPair{{name_suffix}}:
cdef {{DistanceMetric}} distance_metric
cdef:
{{DistanceMetric}} distance_metric
ITYPE_t n_features

cdef ITYPE_t n_samples_X(self) nogil

Expand All @@ -37,5 +38,35 @@ cdef class DenseDenseDatasetsPair{{name_suffix}}(DatasetsPair{{name_suffix}}):
cdef:
const {{INPUT_DTYPE_t}}[:, ::1] X
const {{INPUT_DTYPE_t}}[:, ::1] Y
ITYPE_t d


cdef class SparseSparseDatasetsPair{{name_suffix}}(DatasetsPair{{name_suffix}}):
cdef:
const {{INPUT_DTYPE_t}}[:] X_data
const SPARSE_INDEX_TYPE_t[:] X_indices
const SPARSE_INDEX_TYPE_t[:] X_indptr

const {{INPUT_DTYPE_t}}[:] Y_data
const SPARSE_INDEX_TYPE_t[:] Y_indices
const SPARSE_INDEX_TYPE_t[:] Y_indptr


cdef class SparseDenseDatasetsPair{{name_suffix}}(DatasetsPair{{name_suffix}}):
cdef:
const {{INPUT_DTYPE_t}}[:] X_data
const SPARSE_INDEX_TYPE_t[:] X_indices
const SPARSE_INDEX_TYPE_t[:] X_indptr

const {{INPUT_DTYPE_t}}[:] Y_data
const SPARSE_INDEX_TYPE_t[:] Y_indices
ITYPE_t n_Y


cdef class DenseSparseDatasetsPair{{name_suffix}}(DatasetsPair{{name_suffix}}):
cdef:
# As distance metrics are commutative, we can simply rely
# on the implementation of SparseDenseDatasetsPair and
# swap arguments.
DatasetsPair{{name_suffix}} datasets_pair

{{endfor}}