Access to Diffusion Map methods (and other embedding methods) as Scikit-Learn style API #3054

jolespin · 2024-05-14T00:00:15Z

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

It would be extremely helpful if the embedding manifold tools had scikit-learn style API.

For example, https://pydiffmap.readthedocs.io/en/master/reference/diffusion_map.html

Having the .fit, .transform, and .fit_transform would make the robust implementations in the backend of ScanPy a lot more accessible for users. Right now, the usage feels a bit restrictive and I'm having difficulty leveraging the power of the methods if it's not part of some similar workflow that is in the tutorials.

I'm trying to use the code in the backend of ScanPy implement this API myself but ScanPy is an extremely confusing package from an outside developer. There are nested functions and tests for even simple steps (many of which handle edge cases making the package robust).

More specifically, I'm trying to use the ScanPy implementation of Diffusion Maps as I would use those from pyDiffMap or the spectral clustering from Sklearn.

I would like to be able to fit a model with data. Pickle it. Then transform new samples based on the fitted model. This would provide a useful interface for users looking for a non linear alternative to pca.

flying-sheep · 2024-05-16T11:23:09Z

I’m not categorically against it, but could you describe what issues you encounter trying to use scanpy on the data you want to use it on? E.g. naively, I’d think you’d just wrap your matrix in an AnnData object, then run diffmap:

>>> import scanpy as sc
>>> adata = sc.AnnData(my_matrix)  # shape: (n_observations, n_variables)
>>> sc.tl.diffmap(adata)
ValueError: You need to run `pp.neighbors` first to compute a neighborhood graph.

Then you just follow that advice and repr the object after to see what’s in there:

>>> sc.pp.neighbors(adata)
>>> sc.tl.diffmap(adata)
>>> adata
AnnData object ...
    uns: diffmap_evals
    obsm: X_diffmap

Alternatively you read the docs: The diffmap docs point out both how to use neighbors …

The width (“sigma”) of the connectivity kernel is implicitly determined by the number of neighbors used to compute the single-cell graph in neighbors(). To reproduce the original implementation using a Gaussian kernel, use method=='gauss' in neighbors(). To use an exponential kernel, use the default method=='umap'. Differences between these options shouldn’t usually be dramatic.

… and where the results are pushed:

… Sets the following fields:

adata.obsm['X_diffmap'] : numpy.ndarray (dtype float)

Diffusion map representation of data, which is the right eigen basis of the transition matrix with eigenvectors as columns.

adata.uns['diffmap_evals'] : numpy.ndarray (dtype float)

Array of size (number of eigen vectors). Eigenvalues of transition matrix.

so you just take them out again:

eigenvecs, eigenvals = adata.obsm['X_diffmap'], adata.uns['diffmap_evals']

jolespin · 2024-05-17T15:03:00Z

Thanks for the explanation and walkthrough on where everything is located and how to access it! This is actually very useful. This makes it easier to navigate the addata object. Having access to the Scikit-Learn style API would be useful for incorporating with other sklearn compatible methods. The biggest thing is the .transform method to project new samples into the diffusion space. I've been trying to figure out how to implement this on my own but I hit a snag: https://stackoverflow.com/questions/78486471/how-to-add-a-transform-method-to-project-new-observations-into-an-existing-spac

pyDiffMap has an implementation for Nystroem out-of-sample extensions used to calculate the values of the diffusion coordinates at each given point.. The backend implementations of the algorithms are different so I'm not sure if I can just port this method over.

It would also be great if said sklearn-api would have an option for custom transformers. It looks like this was already implemented but having direct access to a standalone model object w/ this capability would be incredibly useful! Nothing like this exists for DiffusionMaps right now. I'm trying to implement it myself but I also hit a snag when trying to generalize the transformer objects to build connectivity graphs: https://stackoverflow.com/questions/78486997/how-to-reproduce-kneighbors-graphinclude-self-true-using-kneighborstransfor

Any help on this front would be amazing especially if I could just use It directly w/ scanpy as this is my preferred analysis package (I actually started to deprecate my own software suite https://github.com/jolespin/soothsayer because scanpy worked so well).

I work quite a bit in both the microbial ecology realm and single cell transcriptomics using scanpy for both. I'm trying to make a push for the microbial ecology community to start using this software as the problems being solved are very very similar.

jolespin added Enhancement ✨ Triage 🩺 This issue needs to be triaged by a maintainer labels May 14, 2024

flying-sheep added Area – API API design Needs info❔ More information needed and removed Triage 🩺 This issue needs to be triaged by a maintainer labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to Diffusion Map methods (and other embedding methods) as Scikit-Learn style API #3054

Access to Diffusion Map methods (and other embedding methods) as Scikit-Learn style API #3054

jolespin commented May 14, 2024 •

edited

flying-sheep commented May 16, 2024 •

edited

jolespin commented May 17, 2024

Access to Diffusion Map methods (and other embedding methods) as Scikit-Learn style API #3054

Access to Diffusion Map methods (and other embedding methods) as Scikit-Learn style API #3054

Comments

jolespin commented May 14, 2024 • edited

What kind of feature would you like to request?

Please describe your wishes

flying-sheep commented May 16, 2024 • edited

jolespin commented May 17, 2024

jolespin commented May 14, 2024 •

edited

flying-sheep commented May 16, 2024 •

edited