Different ways of computing the fitness of an ORF from the normalized reads #25

leilaicruz · 2020-10-09T09:48:55Z

We can discuss here different approaches on how to get fitness relative values from arbitrary units of normalized reads per ORF

leilaicruz · 2020-10-09T09:54:16Z

Case # 1: Normalization of the fitness based on the normalized reads per bp in the 80% central region of the gene . Here I use the HO locus to normalize with:

Using this measure of fitness for every ORF , I have for the WT and mutant (dpl1d) from Benoit have:

In particular :

So this means , that using this measure the fitness of bem1 deletion is higher than deleting bem3 (in the growth conditions of Benoit) , which make me suspect that this is not an ideal measure of fitness.

Gregory94 · 2020-10-09T10:00:40Z

Using this normalization, do you consider the fact of the differences in the number of insertions in each chromosome is not constant? And also that the differences in the number of reads between different SATAY experiments?

leilaicruz · 2020-10-09T10:02:06Z

Case # 2 :

In this case I get at least that dbem1 fitness is lower than dbem3 , which is what I naively expect .

When I try to predict the most correlated genes to dpl1 , to get the positive and negative interactors I get this:

the gene names that have a "-new" label added to it are the genes I predict that do not match with current annotations in SGD. The ones that do not have the label are matching the current knowledge.

The script can be found here : https://github.com/Gregory94/LaanLab-SATAY-DataAnalysis/blob/dev_Leila/Python_scripts/corrrelation-based-networks/fitness-WT-mutant-from-reads.py

leilaicruz · 2020-10-09T10:03:48Z

Using this normalization, do you consider the fact of the differences in the number of insertions in each chromosome is not constant? And also that the differences in the number of reads between different SATAY experiments?

I just took the values from your column of 'reads_per_bp_80%' of each gene and divided by the value of HO, as a first trying out .

I opened this issue to discuss on other possible ideas to implement.

Wteunisse · 2020-10-16T08:55:15Z

Liedewij suggested in the meeting with Werner, to use an exponentially decaying weighted average for the 10kb non-coding regions around a gene. This would give more weight to non-coding regions close to the gene as they are more likely to be affected by similar low-frequency effects than the non-coding regions further away.
Furthermore, it was suggested to only use this averaging, or moving average, over non-coding regions and just add up all the reads/transposons within a gene and divide by the gene length, otherwise, the regions around the genes would have a large impact on the calculated fitness of the gene which we don't expect.

leilaicruz · 2020-10-16T09:23:36Z

Notes from Werner to calculate the fitness from insertions/reads data

Suppose (to first approximation) every possible transposon insertion (labelled here a, b, c, d, …, n) has the same initial frequency, and is found in N_0 cases in the population. Then each cell with that insertion grows according to its fitness ω_n, commonly defined as inversely proportional to its doubling time τ_n.
After some time T, you stop the experiment and harvest the cells. At this point, the abundance of cells with insertion n , which is N_(n,T), is given by:

You observe N_(n,T), you control T, and you will have to estimate N_0 based on known fitness in the case of non-coding region insertions / HO locus insertions.
Then of course N_0 may not be a single, uniform value for every region in the DNA, probably has Poisson distributed noise, and there is maybe more Poisson noise in converting transposon to reads, and there could be difference in lag times after transforming that may be of influence in your setup (but maybe you know if this is not the case), so you can expand this simple fitness calculation much further, if needed. But as you say, you are still in the testing phase in terms of defining fitness.

leilaicruz · 2020-10-16T09:31:01Z

I guess the question of the negative binomial has to do with the fact that the "abundance of cells with insertion n at t=T e.g. N(n,T)" is what we want to know , but what we observe is the number of reads of the insertion n, which can not be directly coupled with the number of cells with those insertions?

Wteunisse · 2020-10-16T09:37:51Z

In response to Werner's suggestion; I made a growth model based barely on exponential growth. For me, this was to calculate the optimal growth time before harvesting. I based it on the measured doubling times from biotek experiments on WT and bem1del cells, measured lag times, and the observation that the exponential growth phase lasted to about 0.3OD.

Note that there is a huge variation in both measured doubling times and lag times, so this is just a very rough estimate (two lines show + and - std for variation in measured doubling time). In this model, I assumed that 1 on every 2175 cells is a wildtype and the others are bem1del, this value is calculated from the genome length over the length of genes of which a transposon in these genes would lead to a wildtype phenotype.
As Werner pointed out, in the end it is very difficult to compare the read count due to the difference in lag time (here estimated to be 600min for WT and 1100 for bem1del cells). As this is not known in our SATAY data, it becomes difficult to compare values.

As can be seen from this figure, it might be very difficult to obtain a good resolution between neutral mutations in the bem1del and lethal mutations, as WT cells very quickly sweep the population.

Gregory94 · 2020-10-29T14:17:18Z

I came across this article that discusses how to get genetic interactions from transposon sequencing data. It uses Bayesian statistics to determine whether a change in read count is significant and can be accounted for a change in fitness rather than stochasticity and/or noise. For this they also take a negative binomial distribution which they estimate using normal distributions. I don't understand the details yet, but maybe it is interesting.
They also have made a python script available in which you can enter data (in the same format that we our data) and it will determine significant changes in read counts. But they have a slightly different experimental setup, so I am not sure if it would be useful for us.

leilaicruz · 2020-10-29T14:42:09Z

That is great! We should take a look at it, for sure . I knew that someone would have already the same idea hahaha but I hadnt found it yet, thanks Greg!

…

On Thu, Oct 29, 2020, 15:17 Gregory94 ***@***.***> wrote: I came across this article <https://academic.oup.com/nar/article/45/11/e93/3044354> that discusses how to get genetic interactions from transposon sequencing data. It uses Bayesian statistics to determine whether a change in read count is significant and can be accounted for a change in fitness rather than stochasticity and/or noise. For this they also take a negative binomial distribution which they estimate using normal distributions. I don't understand the details yet, but maybe it is interesting. They also have made a python script <https://orca1.tamu.edu/essentiality/GI/> available in which you can enter data (in the same format that we our data) and it will determine significant changes in read counts. But they have a slightly different experimental setup, so I am not sure if it would be useful for us. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#25 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXNYSUGVC22USWDEHYITZ3SNF2P5ANCNFSM4SJ3NCBA> .

EKingma · 2020-11-01T17:32:40Z

I came across this article that discusses how to get genetic interactions from transposon sequencing data. It uses Bayesian statistics to determine whether a change in read count is significant and can be accounted for a change in fitness rather than stochasticity and/or noise. For this they also take a negative binomial distribution which they estimate using normal distributions. I don't understand the details yet, but maybe it is interesting.
They also have made a python script available in which you can enter data (in the same format that we our data) and it will determine significant changes in read counts. But they have a slightly different experimental setup, so I am not sure if it would be useful for us.

Interesting! Their method is indeed a bit different from ours, but it might be an idea to store a sample of the induction culture (just before starting the reseed) @Wteunisse

leilaicruz · 2020-11-06T08:39:59Z

@EKingma, your comment is consequence of what they describe in Page 2 right?

Gregory94 · 2020-11-12T13:18:26Z

This is another paper that also takes a sample before reseeding to compare it with the cells after selection. They have a somewhat more similar approach to what we are currently trying to do for determining genetic interactions. The paper from Dejesus et.al. 2017 (the one I previously mentioned) compare their results with this paper and they are not very enthusiastic about it, yet I think it might still be interesting.

leilaicruz created this issue from a note in SATAY-analysis-workflow-board (In progress) Oct 9, 2020

leilaicruz self-assigned this Oct 9, 2020

leilaicruz added the data processing label Oct 9, 2020

leilaicruz added the help wanted Extra attention is needed label Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different ways of computing the fitness of an ORF from the normalized reads #25

Different ways of computing the fitness of an ORF from the normalized reads #25

leilaicruz commented Oct 9, 2020 •

edited

leilaicruz commented Oct 9, 2020 •

edited

Gregory94 commented Oct 9, 2020

leilaicruz commented Oct 9, 2020

leilaicruz commented Oct 9, 2020

Wteunisse commented Oct 16, 2020

leilaicruz commented Oct 16, 2020

leilaicruz commented Oct 16, 2020

Wteunisse commented Oct 16, 2020

Gregory94 commented Oct 29, 2020

leilaicruz commented Oct 29, 2020 via email

EKingma commented Nov 1, 2020

leilaicruz commented Nov 6, 2020

Gregory94 commented Nov 12, 2020

Different ways of computing the fitness of an ORF from the normalized reads #25

Different ways of computing the fitness of an ORF from the normalized reads #25

Comments

leilaicruz commented Oct 9, 2020 • edited

leilaicruz commented Oct 9, 2020 • edited

Gregory94 commented Oct 9, 2020

leilaicruz commented Oct 9, 2020

leilaicruz commented Oct 9, 2020

Wteunisse commented Oct 16, 2020

leilaicruz commented Oct 16, 2020

leilaicruz commented Oct 16, 2020

Wteunisse commented Oct 16, 2020

Gregory94 commented Oct 29, 2020

leilaicruz commented Oct 29, 2020 via email

EKingma commented Nov 1, 2020

leilaicruz commented Nov 6, 2020

Gregory94 commented Nov 12, 2020

leilaicruz commented Oct 9, 2020 •

edited

leilaicruz commented Oct 9, 2020 •

edited