Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

kaushik-roy-physics · 2024-04-26T01:44:37Z

I have a question regarding the recommendations provided in :

https://github.com/scverse/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb

to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.

_adata = adata.raw if use_raw else adata
_adata_subset = (
_adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
)
# average expression of genes
if issparse(_adata_subset.X):
obs_avg = pd.Series(
np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
index=gene_pool,
)
else:
obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)

# Sometimes (and I don't know how) missing data may be there, with nansfor
obs_avg = obs_avg[np.isfinite(obs_avg)]

How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).

I hope this makes sense. Any clarifications or confirmations would be really helpful.

n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items
control_genes = pd.Index([], dtype="string")

# now pick `ctrl_size` genes from every cut
for cut in np.unique(obs_cut.loc[gene_list]):
    r_genes: pd.Index[str] = obs_cut[obs_cut == cut].index
    if ctrl_size < len(r_genes):
        r_genes = r_genes.to_series().sample(ctrl_size).index
    control_genes = control_genes.union(r_genes.difference(gene_list))

The text was updated successfully, but these errors were encountered:

kaushik-roy-physics changed the title ~~Using scaled adata for sc.tl.score_genes_cell_cycle~~ Loophole with using scaled adata for sc.tl.score_genes_cell_cycle Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

kaushik-roy-physics commented Apr 26, 2024

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

Loophole with using scaled adata for sc.tl.score_genes_cell_cycle #3030

Comments

kaushik-roy-physics commented Apr 26, 2024