Different clustering results when setting metrics = "cosine", using callable cosine function and direct cos distance input. #3021

zby3 · 2024-04-22T17:00:17Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

The following three ways use the same cosine similarity for sc.pp.neighbors following by leiden clustering renders different results:

set metric = "cosine" in sc.pp.neighbors()
use self-written callable cosine similarity function in KNeighborsTransformer and pass to the transformer option in sc.pp.neighbors()
calculate pre-computed cosine similarity matrix then use it to add adata.obsp['connectivities']
Option 1 generates 85 clusters, option 2 generates 170 clusters and option 3 generates 183 clusters.

Minimal code sample

import scanpy as sc
from sklearn.neighbors import KNeighborsTransformer
import numpy as np
from numpy.linalg import norm
from sklearn.metrics.pairwise import cosine_similarity
adata = sc.datasets.pbmc68k_reduced()

###use built-in cosine similarity option
sc.pp.neighbors(adata, n_neighbors=15,n_pcs=0,metric= "cosine")
sc.tl.umap(adata,random_state =42)
sc.tl.leiden(adata,resolution=10)
clusters= np.array(adata.obs["leiden"]).astype(int)
print('num of clusters: '+str(len(set(clusters))))

###use callable cosine similarity metrics
def cos_distance(A, B):
    # calculate the distance, return a float
    cosine = np.dot(A, B) / (norm(A) * norm(B))
    return cosine
transformer = KNeighborsTransformer(n_neighbors=15, metric=cos_distance)
sc.pp.neighbors(adata, transformer=transformer,n_pcs=0)
sc.tl.umap(adata,random_state =42)
sc.tl.leiden(adata,resolution=10)
clusters= np.array(adata.obs["leiden"]).astype(int)
print('num of clusters: '+str(len(set(clusters))))

###use precomputed cosine distance metrics
dis_mat = cosine_similarity(adata.X)

tmp = sc.neighbors._common._get_indices_distances_from_dense_matrix(dis_mat, n_neighbors=15)
adata.obsp["connectivities"] = sc.neighbors._connectivity.umap(
    knn_indices = tmp[0],
    knn_dists = tmp[1],
    n_obs = dis_mat.shape[0],
    n_neighbors = 15,
)
adata.uns["neighbors"] = {"connectivities_key": "connectivities", "params": {"method": None}}
sc.tl.umap(adata,random_state =42)
sc.tl.leiden(adata,resolution=10)
clusters= np.array(adata.obs["leiden"]).astype(int)
print('num of clusters: '+str(len(set(clusters))))

Error output

num of clusters: 85
num of clusters: 170
num of clusters: 183

Versions

-----
anndata     0.10.6
scanpy      1.10.1
-----
IPython             8.22.2
PIL                 10.2.0
asttokens           NA
console_thrift      NA
cycler              0.12.1
cython_runtime      NA
dateutil            2.9.0
decorator           5.1.1
executing           2.0.1
h5py                3.10.0
igraph              0.11.4
jedi                0.19.1
joblib              1.3.2
kiwisolver          1.4.5
legacy_api_wrap     NA
leidenalg           0.10.2
llvmlite            0.42.0
matplotlib          3.8.3
mpl_toolkits        NA
natsort             8.4.0
numba               0.59.1
numpy               1.26.4
packaging           24.0
pandas              2.2.1
parso               0.8.3
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.42
psutil              5.9.0
pure_eval           0.2.2
pydev_console       NA
pydev_ipython       NA
pydevconsole        NA
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.17.2
pynndescent         0.5.11
pyparsing           3.1.2
pytz                2024.1
scipy               1.12.0
session_info        1.0.0
six                 1.16.0
sklearn             1.4.1.post1
stack_data          0.6.2
texttable           1.7.0
threadpoolctl       3.4.0
tqdm                4.66.2
traitlets           5.14.2
typing_extensions   NA
umap                0.5.5
wcwidth             0.2.13
-----
Python 3.11.7 (main, Dec 15 2023, 12:09:56) [Clang 14.0.6 ]
macOS-14.3.1-arm64-arm-64bit
-----
Session information updated at 2024-04-22 09:58

The text was updated successfully, but these errors were encountered:

zby3 added the Bug 🐛 label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different clustering results when setting metrics = "cosine", using callable cosine function and direct cos distance input. #3021

Different clustering results when setting metrics = "cosine", using callable cosine function and direct cos distance input. #3021

zby3 commented Apr 22, 2024

Different clustering results when setting metrics = "cosine", using callable cosine function and direct cos distance input. #3021

Different clustering results when setting metrics = "cosine", using callable cosine function and direct cos distance input. #3021

Comments

zby3 commented Apr 22, 2024

Please make sure these conditions are met

What happened?

Minimal code sample

Error output

Versions