Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

Open
kaushik-roy-physics opened this issue Apr 26, 2024 · 0 comments

Comments

@kaushik-roy-physics
Copy link

I have a question regarding the recommendations provided in :

https://github.com/scverse/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb

to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.


_adata = adata.raw if use_raw else adata
_adata_subset = (
_adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
)
# average expression of genes
if issparse(_adata_subset.X):
obs_avg = pd.Series(
np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
index=gene_pool,
)
else:
obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)

# Sometimes (and I don't know how) missing data may be there, with nansfor
obs_avg = obs_avg[np.isfinite(obs_avg)]

How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).

I hope this makes sense. Any clarifications or confirmations would be really helpful.


n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items
control_genes = pd.Index([], dtype="string")

# now pick `ctrl_size` genes from every cut
for cut in np.unique(obs_cut.loc[gene_list]):
    r_genes: pd.Index[str] = obs_cut[obs_cut == cut].index
    if ctrl_size < len(r_genes):
        r_genes = r_genes.to_series().sample(ctrl_size).index
    control_genes = control_genes.union(r_genes.difference(gene_list))
@kaushik-roy-physics kaushik-roy-physics changed the title Using scaled adata for sc.tl.score_genes_cell_cycle Loophole with using scaled adata for sc.tl.score_genes_cell_cycle Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant