Volcano plot for showing significant changes in reads per gene #35

Gregory94 · 2021-02-23T10:21:01Z

To indicate genes that have a fitness change between different libraries, I am creating a volcano plot.
A dot represents a single gene. On the x-axis it shows the fold change in number of reads or insertions between two libraries and on the y-axis it shows the p-value (determined by an independent t-test). Note the the x-axis is in log2 scale and the y-axis is in -log10 scale.
So the interesting genes are the ones that are high on the y-axis and far away from 0 on the x-axis.

When I compare my results with those found by Agnes (Kornmann lab), the numbers don't seem to match. See the below figure which is using the same dataset.

The method I use for determining the fold change is I sum over all reads of a specific gene in all datasets of a library (e.g. we have 2 wt datasets and 4 dNrp1 datasets) and then I normalize for the total number of insertions in the library. So I end up with two values, the normalized summed number of reads for wt and for dNrp1.
The p-value of the student t-test is determined by python using scipy.stats.ttest_ind(wt_datasets, dNrp1_datasets).

One thing in my graph is the cluster with large negative fold change. I checked the genes in this cluster and they are all genes that have 0 reads in all datasets except for one. This probably messes up the fold change calculation, but I am not sure yet about a way of dealing with this.
Another thing is that I don't know for each dataset the total number of insertions. Now I only use the insertion count in all genes (which is the only thing I have data from), but this ignores all insertions outside the genes which might explain some differences between my results and those from Agnes.

I use volcano.py for creating the above plot.

The text was updated successfully, but these errors were encountered:

leilaicruz · 2021-02-23T13:50:04Z

I sum over all reads of a specific gene in all datasets of a library (e.g. we have 2 wt datasets and 4 dNrp1 datasets) and then I normalize for the total number of insertions in the library.

I dont get this , shouldnt you also sum over the total number of reads from each library as the normalization for the sum of reads per gene across the same genotype library? I think if you sum reads , you shouldnt normalize with insertions.

One thing in my graph is the cluster with large negative fold change. I checked the genes in this cluster and they are all genes that have 0 reads in all datasets except for one. This probably messes up the fold change calculation, but I am not sure yet about a way of dealing with this.

those genes have zero reads in both , the reference and experimental library? what about the insertions? if you do a volcano for the insertions do you see the same?

Another thing is that I don't know for each dataset the total number of insertions. Now I only use the insertion count in all genes (which is the only thing I have data from), but this ignores all insertions outside the genes which might explain some differences between my results and those from Agnes.

so in that case , I think it shouldnt matter that much because you are only off by a constant . But the pattern of low vs high log FC genes should be the same

leilaicruz · 2021-02-23T13:51:04Z

and also would you put a link to where we can see your code?

Gregory94 · 2021-02-23T14:52:12Z

Sorry if my explanation wasn't entirely clear.
During normalization I do distinguish between reads and insertions, so reads are normalized with the total number of reads and insertions with the total number of insertions.

For the genes in the cluster with low fold change, these genes have, for example, zero counts in both wt datasets and in 3 dNrp1 datasets (there are 4 dNrp1 datasets). This is similar when I look at both the number of reads and the number of insertions. When I plot the number of insertions per gene, it also shows clustering of a few genes.

Regarding the the normalization with only the reads in the genes instead all the reads in the entire genome. I agree this shouldn't change the distribution, but it might be one of the things that explains why the values on the axes of my graph are different.

leilaicruz · 2021-02-23T15:10:16Z

where is nrp1 localized when you compare all WT vs all dnrp1? this one should be highly significant with a high FC for the WT with respect the dnrp1.

Gregory94 · 2021-02-23T15:58:06Z

Nrp1 does show a relatively large and significant fold change in both the reads and insertions.

Gregory94 · 2021-02-24T12:14:25Z

I have checked the genes in the cluster at high fold changes, and it seems to be caused by situations where one library has insertions, but the other dataset doesn't. In this case the fold change is not correctly determined. Normally the fold change is determined by the ratio between the experimental dataset (dNrp1) and the reference dataset (wt). But, for example, if one dataset has 10 insertions and the other dataset has 0 insertions, then the ratio doesn't work anymore since either it will be 0 or there is a division by 0. To circumvent this issue I currently have that in these cases the fold change is equal to the number of insertions of the non-zero dataset (i.e. 10 in this example). But apparently this exaggerates the fold change compared to the rest of the data and technically this isn't correct as well. I have looked to solutions and one that I have found is where they say they add a very small number to the zero dataset to prevent the division with zero. Since they have very different data, I am not sure whether this is a valid assumption for our data as well. Also, it will be depending on what number you are choosing what the outcome will be. Also it was suggested to just leave those data points out, but I think that isn't a nice approach as well since these genes are relevant to us.

leilaicruz · 2021-02-24T12:51:06Z

Ok great, so in your plot the points in the left are the ones who are significant for the WT library in comparison to the mutant library.

For the problematic points you could also exclude them for now ,from the analysis . I will soon hear from Agnes about how they built these plots and the raw sequencing data!

One note on the number of transposons , she said , that they mapped ~ 500000 Tn for the mutants and 400000 for the WT. So maybe that improves your normalization.

Gregory94 self-assigned this Feb 23, 2021

Gregory94 added the data exploration label Feb 23, 2021

leilaicruz added this to To do in SATAY-analysis-workflow-board via automation Feb 23, 2021

leilaicruz moved this from To do to In progress in SATAY-analysis-workflow-board Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volcano plot for showing significant changes in reads per gene #35

Volcano plot for showing significant changes in reads per gene #35

Gregory94 commented Feb 23, 2021 •

edited

leilaicruz commented Feb 23, 2021

leilaicruz commented Feb 23, 2021

Gregory94 commented Feb 23, 2021

leilaicruz commented Feb 23, 2021

Gregory94 commented Feb 23, 2021

Gregory94 commented Feb 24, 2021

leilaicruz commented Feb 24, 2021

Volcano plot for showing significant changes in reads per gene #35

Volcano plot for showing significant changes in reads per gene #35

Comments

Gregory94 commented Feb 23, 2021 • edited

leilaicruz commented Feb 23, 2021

leilaicruz commented Feb 23, 2021

Gregory94 commented Feb 23, 2021

leilaicruz commented Feb 23, 2021

Gregory94 commented Feb 23, 2021

Gregory94 commented Feb 24, 2021

leilaicruz commented Feb 24, 2021

Gregory94 commented Feb 23, 2021 •

edited