Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.

Different ways of computing the fitness of an ORF from the normalized reads #25

Open
leilaicruz opened this issue Oct 9, 2020 · 13 comments
Assignees
Labels
data processing help wanted Extra attention is needed

Comments

@leilaicruz
Copy link
Member

leilaicruz commented Oct 9, 2020

We can discuss here different approaches on how to get fitness relative values from arbitrary units of normalized reads per ORF

@leilaicruz leilaicruz created this issue from a note in SATAY-analysis-workflow-board (In progress) Oct 9, 2020
@leilaicruz leilaicruz self-assigned this Oct 9, 2020
@leilaicruz
Copy link
Member Author

leilaicruz commented Oct 9, 2020

Case # 1: Normalization of the fitness based on the normalized reads per bp in the 80% central region of the gene . Here I use the HO locus to normalize with:
image

Using this measure of fitness for every ORF , I have for the WT and mutant (dpl1d) from Benoit have:
image

In particular :
image

So this means , that using this measure the fitness of bem1 deletion is higher than deleting bem3 (in the growth conditions of Benoit) , which make me suspect that this is not an ideal measure of fitness.

@Gregory94
Copy link
Collaborator

Using this normalization, do you consider the fact of the differences in the number of insertions in each chromosome is not constant? And also that the differences in the number of reads between different SATAY experiments?

@leilaicruz
Copy link
Member Author

Case # 2 :
image

image

In this case I get at least that dbem1 fitness is lower than dbem3 , which is what I naively expect .

When I try to predict the most correlated genes to dpl1 , to get the positive and negative interactors I get this:

image

the gene names that have a "-new" label added to it are the genes I predict that do not match with current annotations in SGD. The ones that do not have the label are matching the current knowledge.

The script can be found here : https://github.com/Gregory94/LaanLab-SATAY-DataAnalysis/blob/dev_Leila/Python_scripts/corrrelation-based-networks/fitness-WT-mutant-from-reads.py

@leilaicruz
Copy link
Member Author

Using this normalization, do you consider the fact of the differences in the number of insertions in each chromosome is not constant? And also that the differences in the number of reads between different SATAY experiments?

I just took the values from your column of 'reads_per_bp_80%' of each gene and divided by the value of HO, as a first trying out .

I opened this issue to discuss on other possible ideas to implement.

@Wteunisse
Copy link
Collaborator

Liedewij suggested in the meeting with Werner, to use an exponentially decaying weighted average for the 10kb non-coding regions around a gene. This would give more weight to non-coding regions close to the gene as they are more likely to be affected by similar low-frequency effects than the non-coding regions further away.
Furthermore, it was suggested to only use this averaging, or moving average, over non-coding regions and just add up all the reads/transposons within a gene and divide by the gene length, otherwise, the regions around the genes would have a large impact on the calculated fitness of the gene which we don't expect.

@leilaicruz
Copy link
Member Author

Notes from Werner to calculate the fitness from insertions/reads data

Suppose (to first approximation) every possible transposon insertion (labelled here a, b, c, d, …, n) has the same initial frequency, and is found in N_0 cases in the population. Then each cell with that insertion grows according to its fitness ω_n, commonly defined as inversely proportional to its doubling time τ_n.
After some time T, you stop the experiment and harvest the cells. At this point, the abundance of cells with insertion n , which is N_(n,T), is given by:

image

You observe N_(n,T), you control T, and you will have to estimate N_0 based on known fitness in the case of non-coding region insertions / HO locus insertions.
Then of course N_0 may not be a single, uniform value for every region in the DNA, probably has Poisson distributed noise, and there is maybe more Poisson noise in converting transposon to reads, and there could be difference in lag times after transforming that may be of influence in your setup (but maybe you know if this is not the case), so you can expand this simple fitness calculation much further, if needed. But as you say, you are still in the testing phase in terms of defining fitness.

@leilaicruz
Copy link
Member Author

I guess the question of the negative binomial has to do with the fact that the "abundance of cells with insertion n at t=T e.g. N(n,T)" is what we want to know , but what we observe is the number of reads of the insertion n, which can not be directly coupled with the number of cells with those insertions?

@Wteunisse
Copy link
Collaborator

In response to Werner's suggestion; I made a growth model based barely on exponential growth. For me, this was to calculate the optimal growth time before harvesting. I based it on the measured doubling times from biotek experiments on WT and bem1del cells, measured lag times, and the observation that the exponential growth phase lasted to about 0.3OD.
growthmodel_bem1del
Note that there is a huge variation in both measured doubling times and lag times, so this is just a very rough estimate (two lines show + and - std for variation in measured doubling time). In this model, I assumed that 1 on every 2175 cells is a wildtype and the others are bem1del, this value is calculated from the genome length over the length of genes of which a transposon in these genes would lead to a wildtype phenotype.
As Werner pointed out, in the end it is very difficult to compare the read count due to the difference in lag time (here estimated to be 600min for WT and 1100 for bem1del cells). As this is not known in our SATAY data, it becomes difficult to compare values.

As can be seen from this figure, it might be very difficult to obtain a good resolution between neutral mutations in the bem1del and lethal mutations, as WT cells very quickly sweep the population.

@leilaicruz leilaicruz added the help wanted Extra attention is needed label Oct 16, 2020
@Gregory94
Copy link
Collaborator

I came across this article that discusses how to get genetic interactions from transposon sequencing data. It uses Bayesian statistics to determine whether a change in read count is significant and can be accounted for a change in fitness rather than stochasticity and/or noise. For this they also take a negative binomial distribution which they estimate using normal distributions. I don't understand the details yet, but maybe it is interesting.
They also have made a python script available in which you can enter data (in the same format that we our data) and it will determine significant changes in read counts. But they have a slightly different experimental setup, so I am not sure if it would be useful for us.

@leilaicruz
Copy link
Member Author

leilaicruz commented Oct 29, 2020 via email

@EKingma
Copy link
Collaborator

EKingma commented Nov 1, 2020

I came across this article that discusses how to get genetic interactions from transposon sequencing data. It uses Bayesian statistics to determine whether a change in read count is significant and can be accounted for a change in fitness rather than stochasticity and/or noise. For this they also take a negative binomial distribution which they estimate using normal distributions. I don't understand the details yet, but maybe it is interesting.
They also have made a python script available in which you can enter data (in the same format that we our data) and it will determine significant changes in read counts. But they have a slightly different experimental setup, so I am not sure if it would be useful for us.

Interesting! Their method is indeed a bit different from ours, but it might be an idea to store a sample of the induction culture (just before starting the reseed) @Wteunisse

@leilaicruz
Copy link
Member Author

@EKingma, your comment is consequence of what they describe in Page 2 right?
image

@Gregory94
Copy link
Collaborator

This is another paper that also takes a sample before reseeding to compare it with the cells after selection. They have a somewhat more similar approach to what we are currently trying to do for determining genetic interactions. The paper from Dejesus et.al. 2017 (the one I previously mentioned) compare their results with this paper and they are not very enthusiastic about it, yet I think it might still be interesting.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data processing help wanted Extra attention is needed
Projects
Development

No branches or pull requests

4 participants