Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.

Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21

Open
Gregory94 opened this issue Sep 3, 2020 · 11 comments

Comments

@Gregory94
Copy link
Collaborator

Gregory94 commented Sep 3, 2020

For estimating the fitness level for each insertion, the data needs to be normalized to compensate for the fact that the likelihood for transposon insertions is not constant throughout the chromosomes.
After normalization, the neutral regions are expected to have similar number of reads. Using this result, a fitness level can be defined, ideally between 0 and 1 where 1 is defined by the cells that have the most number of reads.

@Gregory94 Gregory94 created this issue from a note in SATAY-analysis-workflow-board (In progress) Sep 3, 2020
@Gregory94 Gregory94 self-assigned this Sep 3, 2020
@Gregory94
Copy link
Collaborator Author

Gregory94 commented Sep 3, 2020

Currently I'm developing a tool that determine this local background by searching, for each CDS, the local neutral regions (i.e. regions that do not seem to code for anything special like genes, telomeres, ARS etc.). The current python script can be found here.

@Gregory94 Gregory94 changed the title Normalizing number of insertions and reads per CDS using the number of insertions in local neutral DNA. Estimating fitness level for each insertion based on the number of reads Sep 18, 2020
@Gregory94
Copy link
Collaborator Author

Gregory94 commented Sep 18, 2020

I made a first attempt in estimating the fitness of the cells based on number of reads. I went for the most simple approach which is dividing a chromosome in sections (which I called windows) and normalized the data within each window (see attached file). I did the normalization like this because the likelyhood of an insertion is not constant throughout chromosome, so some regions (typically around the centrosomes) typically bare more transposons. After I got a distribution that looked more flat (e.g. the average went from 0.69+/-1.53 reads per basepair per region before normalization to 0.66+/-1.33 after normalization). Finally, I divided all reads by the largest bar (assuming these cells had the highest fitness), which gives me a value between 0 and 1 which I then define as the fitness level.

I think this approach might be not so good after all, since it is a bit dubious how to define the windows for normalization.
Also, there are typically outliers (i.e. few bars with much more reads compared to the other bars), even after normalization.
Also, I would expect the number of reads in the noncoding regions to be more similar to each other then they are now, which would suspect the normalization is not entirely accurate.
So I am going to look at other approaches as well.

Fitness_from_reads1.pdf

@leilaicruz
Copy link
Member

Great Greg!
The next step is to validate this fitness measure with the available information on positive and negative interactors of dpl1, using the correlation matrix of positive and negative scores (score=ab-a*b).

@Gregory94
Copy link
Collaborator Author

Normalization of the reads is updated with a method introduced in the paper by Gallagher et.al. 2011. This is done in the code genomicfeatures_dataframe_with_normalization. The normalization equation is given in the attached figure.
Read_Normalization_Equation
For the normalization of the interchromosomal regions, only the noncoding regions are considered. Attached are two plots of the same chromosome (chromosome V from WT1 from Benoit), the first which is not normalized and the second which is normalized using this code.
chrV_nonnorm
chrV_norm

This code creates a dataframe that contains all important information, the feature name and type (e.g. gene, telomeres, centromeres etc.), essentiality, genomic position, length, number of insertions, number of reads, normalized number of reads, number of insertions and reads in the central 80% of the genes, etc.

Currently it needs to be investigated how to determine a relevant window size and the codes needs to be properly tested and verified.

@leilaicruz
Copy link
Member

To complete the fitness calculation we would also need the data from the normalized reads of the mutant : dpl1d

@leilaicruz
Copy link
Member

leilaicruz commented Oct 5, 2020

@Gregory94
Copy link
Collaborator Author

I added the gene aliases in the dataframe dna_df2 (see genomicfeatures_dataframe_with_normalization.py). This thus now includes the systematic gene names (e.g. YLR229C) and the standard name(s) (e.g. CDC42).

@leilaicruz leilaicruz changed the title Estimating fitness level for each insertion based on the number of reads Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes Oct 9, 2020
@Wteunisse
Copy link
Collaborator

An easy addition to this, as discussed in the meeting with Werner, is to divide the read number by the number of transposons, given that we have multiple transposons and reads added up as a result of the bin size or a moving average. This should help us to get rid of the peak around the centromere because this peak is only due to an increased number of transposons and not due to more reads per transposon.

@Wteunisse
Copy link
Collaborator

Leila commented here that we could check if this is useful by calculating the variation in the supposed neutral regions over the chromosome, as after this normalization we would expect little variation in the number of reads over all the neutral regions.

@Gregory94
Copy link
Collaborator Author

yes, I have already added this to the dataframe variable (see also the python notebook of the same code) using the name Nreadsperinsrt and I also did this for the truncated gene where I ignore the first and last 100bp of the genes. I took the number of transposons per genomic feature (i.e. gene, telomere, noncoding region etc.). I haven't done any statistics on this data yet.

@Gregory94
Copy link
Collaborator Author

See also this question at the satay forum about normalization.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

3 participants