You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.
_adata = adata.raw if use_raw else adata
_adata_subset = (
_adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
)
# average expression of genes
if issparse(_adata_subset.X):
obs_avg = pd.Series(
np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
index=gene_pool,
)
else:
obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)
# Sometimes (and I don't know how) missing data may be there, with nansfor
obs_avg = obs_avg[np.isfinite(obs_avg)]
How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).
I hope this makes sense. Any clarifications or confirmations would be really helpful.
# now pick `ctrl_size` genes from every cut
for cut in np.unique(obs_cut.loc[gene_list]):
r_genes: pd.Index[str] = obs_cut[obs_cut == cut].index
if ctrl_size < len(r_genes):
r_genes = r_genes.to_series().sample(ctrl_size).index
control_genes = control_genes.union(r_genes.difference(gene_list))
The text was updated successfully, but these errors were encountered:
kaushik-roy-physics
changed the title
Using scaled adata for sc.tl.score_genes_cell_cycle
Loophole with using scaled adata for sc.tl.score_genes_cell_cycle
Apr 26, 2024
I have a question regarding the recommendations provided in :
https://github.com/scverse/scanpy_usage/blob/master/180209_cell_cycle/cell_cycle.ipynb
to compute cell cycle scores using Scanpy. Specifically it mentions that the adata should be scaled to have zero mean and unit variance before using sc.tl.score_genes_cell_cycle. But when I look at the source code, particularly the part pasted below, I notice that unless the 'use_raw' option is set to True, obs_avg will be zero for the adata after scaling assuming gene_pool is not set.
_adata = adata.raw if use_raw else adata
_adata_subset = (
_adata[:, gene_pool] if len(gene_pool) < len(_adata.var_names) else _adata
)
# average expression of genes
if issparse(_adata_subset.X):
obs_avg = pd.Series(
np.array(_sparse_nanmean(_adata_subset.X, axis=0)).flatten(),
index=gene_pool,
)
else:
obs_avg = pd.Series(np.nanmean(_adata_subset.X, axis=0), index=gene_pool)
How is the selection of control_genes meaningful in this scenario? We will basically have, control_genes=min(len(s_genes, g2m_genes)) which is a randomly sampled subset of the total genes but this won't have this feature of selecting genes that are similar in expression levels to the marker gene list. I think we should find control_genes using the expression data in the raw adata, identify the control_genes using the binning procedure below and then calculate s_scores and g2m_scores for these genes using the scaled data if that is necessary (mean of these randomly selected genes in the scaled adata expression matrix).
I hope this makes sense. Any clarifications or confirmations would be really helpful.
n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items
control_genes = pd.Index([], dtype="string")
The text was updated successfully, but these errors were encountered: