Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.

Volcano plot for showing significant changes in reads per gene #35

Open
Gregory94 opened this issue Feb 23, 2021 · 7 comments
Open

Volcano plot for showing significant changes in reads per gene #35

Gregory94 opened this issue Feb 23, 2021 · 7 comments

Comments

@Gregory94
Copy link
Collaborator

Gregory94 commented Feb 23, 2021

To indicate genes that have a fitness change between different libraries, I am creating a volcano plot.
A dot represents a single gene. On the x-axis it shows the fold change in number of reads or insertions between two libraries and on the y-axis it shows the p-value (determined by an independent t-test). Note the the x-axis is in log2 scale and the y-axis is in -log10 scale.
So the interesting genes are the ones that are high on the y-axis and far away from 0 on the x-axis.
volcanoplot_reads
When I compare my results with those found by Agnes (Kornmann lab), the numbers don't seem to match. See the below figure which is using the same dataset.
volcanoplot_reads_Agnes
The method I use for determining the fold change is I sum over all reads of a specific gene in all datasets of a library (e.g. we have 2 wt datasets and 4 dNrp1 datasets) and then I normalize for the total number of insertions in the library. So I end up with two values, the normalized summed number of reads for wt and for dNrp1.
The p-value of the student t-test is determined by python using scipy.stats.ttest_ind(wt_datasets, dNrp1_datasets).

One thing in my graph is the cluster with large negative fold change. I checked the genes in this cluster and they are all genes that have 0 reads in all datasets except for one. This probably messes up the fold change calculation, but I am not sure yet about a way of dealing with this.
Another thing is that I don't know for each dataset the total number of insertions. Now I only use the insertion count in all genes (which is the only thing I have data from), but this ignores all insertions outside the genes which might explain some differences between my results and those from Agnes.

I use volcano.py for creating the above plot.

@leilaicruz
Copy link
Member

I sum over all reads of a specific gene in all datasets of a library (e.g. we have 2 wt datasets and 4 dNrp1 datasets) and then I normalize for the total number of insertions in the library.

I dont get this , shouldnt you also sum over the total number of reads from each library as the normalization for the sum of reads per gene across the same genotype library? I think if you sum reads , you shouldnt normalize with insertions.

One thing in my graph is the cluster with large negative fold change. I checked the genes in this cluster and they are all genes that have 0 reads in all datasets except for one. This probably messes up the fold change calculation, but I am not sure yet about a way of dealing with this.

those genes have zero reads in both , the reference and experimental library? what about the insertions? if you do a volcano for the insertions do you see the same?

Another thing is that I don't know for each dataset the total number of insertions. Now I only use the insertion count in all genes (which is the only thing I have data from), but this ignores all insertions outside the genes which might explain some differences between my results and those from Agnes.

so in that case , I think it shouldnt matter that much because you are only off by a constant . But the pattern of low vs high log FC genes should be the same

@leilaicruz leilaicruz added this to To do in SATAY-analysis-workflow-board via automation Feb 23, 2021
@leilaicruz leilaicruz moved this from To do to In progress in SATAY-analysis-workflow-board Feb 23, 2021
@leilaicruz
Copy link
Member

and also would you put a link to where we can see your code?

@Gregory94
Copy link
Collaborator Author

Sorry if my explanation wasn't entirely clear.
During normalization I do distinguish between reads and insertions, so reads are normalized with the total number of reads and insertions with the total number of insertions.

For the genes in the cluster with low fold change, these genes have, for example, zero counts in both wt datasets and in 3 dNrp1 datasets (there are 4 dNrp1 datasets). This is similar when I look at both the number of reads and the number of insertions. When I plot the number of insertions per gene, it also shows clustering of a few genes.

Regarding the the normalization with only the reads in the genes instead all the reads in the entire genome. I agree this shouldn't change the distribution, but it might be one of the things that explains why the values on the axes of my graph are different.

@leilaicruz
Copy link
Member

where is nrp1 localized when you compare all WT vs all dnrp1? this one should be highly significant with a high FC for the WT with respect the dnrp1.

@Gregory94
Copy link
Collaborator Author

Nrp1 does show a relatively large and significant fold change in both the reads and insertions.
volcanoplot_reads
volcanoplot_tn

@Gregory94
Copy link
Collaborator Author

I have checked the genes in the cluster at high fold changes, and it seems to be caused by situations where one library has insertions, but the other dataset doesn't. In this case the fold change is not correctly determined. Normally the fold change is determined by the ratio between the experimental dataset (dNrp1) and the reference dataset (wt). But, for example, if one dataset has 10 insertions and the other dataset has 0 insertions, then the ratio doesn't work anymore since either it will be 0 or there is a division by 0. To circumvent this issue I currently have that in these cases the fold change is equal to the number of insertions of the non-zero dataset (i.e. 10 in this example). But apparently this exaggerates the fold change compared to the rest of the data and technically this isn't correct as well. I have looked to solutions and one that I have found is where they say they add a very small number to the zero dataset to prevent the division with zero. Since they have very different data, I am not sure whether this is a valid assumption for our data as well. Also, it will be depending on what number you are choosing what the outcome will be. Also it was suggested to just leave those data points out, but I think that isn't a nice approach as well since these genes are relevant to us.

@leilaicruz
Copy link
Member

Ok great, so in your plot the points in the left are the ones who are significant for the WT library in comparison to the mutant library.

For the problematic points you could also exclude them for now ,from the analysis . I will soon hear from Agnes about how they built these plots and the raw sequencing data!

One note on the number of transposons , she said , that they mapped ~ 500000 Tn for the mutants and 400000 for the WT. So maybe that improves your normalization.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants