Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to Diffusion Map methods (and other embedding methods) as Scikit-Learn style API #3054

Open
jolespin opened this issue May 14, 2024 · 2 comments
Labels
Area – API API design Enhancement ✨ Needs info❔ More information needed

Comments

@jolespin
Copy link

jolespin commented May 14, 2024

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

It would be extremely helpful if the embedding manifold tools had scikit-learn style API.

For example, https://pydiffmap.readthedocs.io/en/master/reference/diffusion_map.html

Having the .fit, .transform, and .fit_transform would make the robust implementations in the backend of ScanPy a lot more accessible for users. Right now, the usage feels a bit restrictive and I'm having difficulty leveraging the power of the methods if it's not part of some similar workflow that is in the tutorials.

I'm trying to use the code in the backend of ScanPy implement this API myself but ScanPy is an extremely confusing package from an outside developer. There are nested functions and tests for even simple steps (many of which handle edge cases making the package robust).

More specifically, I'm trying to use the ScanPy implementation of Diffusion Maps as I would use those from pyDiffMap or the spectral clustering from Sklearn.

I would like to be able to fit a model with data. Pickle it. Then transform new samples based on the fitted model. This would provide a useful interface for users looking for a non linear alternative to pca.

@jolespin jolespin added Enhancement ✨ Triage 🩺 This issue needs to be triaged by a maintainer labels May 14, 2024
@flying-sheep flying-sheep added Area – API API design Needs info❔ More information needed and removed Triage 🩺 This issue needs to be triaged by a maintainer labels May 16, 2024
@flying-sheep
Copy link
Member

flying-sheep commented May 16, 2024

I’m not categorically against it, but could you describe what issues you encounter trying to use scanpy on the data you want to use it on? E.g. naively, I’d think you’d just wrap your matrix in an AnnData object, then run diffmap:

>>> import scanpy as sc
>>> adata = sc.AnnData(my_matrix)  # shape: (n_observations, n_variables)
>>> sc.tl.diffmap(adata)
ValueError: You need to run `pp.neighbors` first to compute a neighborhood graph.

Then you just follow that advice and repr the object after to see what’s in there:

>>> sc.pp.neighbors(adata)
>>> sc.tl.diffmap(adata)
>>> adata
AnnData object ...
    uns: diffmap_evals
    obsm: X_diffmap

Alternatively you read the docs: The diffmap docs point out both how to use neighbors

The width (“sigma”) of the connectivity kernel is implicitly determined by the number of neighbors used to compute the single-cell graph in neighbors(). To reproduce the original implementation using a Gaussian kernel, use method=='gauss' in neighbors(). To use an exponential kernel, use the default method=='umap'. Differences between these options shouldn’t usually be dramatic.

… and where the results are pushed:

… Sets the following fields:

adata.obsm['X_diffmap'] : numpy.ndarray (dtype float)

Diffusion map representation of data, which is the right eigen basis of the transition matrix with eigenvectors as columns.

adata.uns['diffmap_evals'] : numpy.ndarray (dtype float)

Array of size (number of eigen vectors). Eigenvalues of transition matrix.

so you just take them out again:

eigenvecs, eigenvals = adata.obsm['X_diffmap'], adata.uns['diffmap_evals']

@jolespin
Copy link
Author

Thanks for the explanation and walkthrough on where everything is located and how to access it! This is actually very useful. This makes it easier to navigate the addata object. Having access to the Scikit-Learn style API would be useful for incorporating with other sklearn compatible methods. The biggest thing is the .transform method to project new samples into the diffusion space. I've been trying to figure out how to implement this on my own but I hit a snag: https://stackoverflow.com/questions/78486471/how-to-add-a-transform-method-to-project-new-observations-into-an-existing-spac

pyDiffMap has an implementation for Nystroem out-of-sample extensions used to calculate the values of the diffusion coordinates at each given point.. The backend implementations of the algorithms are different so I'm not sure if I can just port this method over.

It would also be great if said sklearn-api would have an option for custom transformers. It looks like this was already implemented but having direct access to a standalone model object w/ this capability would be incredibly useful! Nothing like this exists for DiffusionMaps right now. I'm trying to implement it myself but I also hit a snag when trying to generalize the transformer objects to build connectivity graphs: https://stackoverflow.com/questions/78486997/how-to-reproduce-kneighbors-graphinclude-self-true-using-kneighborstransfor

Any help on this front would be amazing especially if I could just use It directly w/ scanpy as this is my preferred analysis package (I actually started to deprecate my own software suite https://github.com/jolespin/soothsayer because scanpy worked so well).

I work quite a bit in both the microbial ecology realm and single cell transcriptomics using scanpy for both. I'm trying to make a push for the microbial ecology community to start using this software as the problems being solved are very very similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area – API API design Enhancement ✨ Needs info❔ More information needed
Projects
None yet
Development

No branches or pull requests

2 participants