Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21

Gregory94 · 2020-09-03T11:55:21Z

For estimating the fitness level for each insertion, the data needs to be normalized to compensate for the fact that the likelihood for transposon insertions is not constant throughout the chromosomes.
After normalization, the neutral regions are expected to have similar number of reads. Using this result, a fitness level can be defined, ideally between 0 and 1 where 1 is defined by the cells that have the most number of reads.

Gregory94 · 2020-09-03T12:05:13Z

Currently I'm developing a tool that determine this local background by searching, for each CDS, the local neutral regions (i.e. regions that do not seem to code for anything special like genes, telomeres, ARS etc.). The current python script can be found here.

Gregory94 · 2020-09-18T09:14:06Z

I made a first attempt in estimating the fitness of the cells based on number of reads. I went for the most simple approach which is dividing a chromosome in sections (which I called windows) and normalized the data within each window (see attached file). I did the normalization like this because the likelyhood of an insertion is not constant throughout chromosome, so some regions (typically around the centrosomes) typically bare more transposons. After I got a distribution that looked more flat (e.g. the average went from 0.69+/-1.53 reads per basepair per region before normalization to 0.66+/-1.33 after normalization). Finally, I divided all reads by the largest bar (assuming these cells had the highest fitness), which gives me a value between 0 and 1 which I then define as the fitness level.

I think this approach might be not so good after all, since it is a bit dubious how to define the windows for normalization.
Also, there are typically outliers (i.e. few bars with much more reads compared to the other bars), even after normalization.
Also, I would expect the number of reads in the noncoding regions to be more similar to each other then they are now, which would suspect the normalization is not entirely accurate.
So I am going to look at other approaches as well.

Fitness_from_reads1.pdf

leilaicruz · 2020-09-18T12:34:01Z

Great Greg!
The next step is to validate this fitness measure with the available information on positive and negative interactors of dpl1, using the correlation matrix of positive and negative scores (score=ab-a*b).

Gregory94 · 2020-10-02T08:12:28Z

Normalization of the reads is updated with a method introduced in the paper by Gallagher et.al. 2011. This is done in the code genomicfeatures_dataframe_with_normalization. The normalization equation is given in the attached figure.

For the normalization of the interchromosomal regions, only the noncoding regions are considered. Attached are two plots of the same chromosome (chromosome V from WT1 from Benoit), the first which is not normalized and the second which is normalized using this code.

This code creates a dataframe that contains all important information, the feature name and type (e.g. gene, telomeres, centromeres etc.), essentiality, genomic position, length, number of insertions, number of reads, normalized number of reads, number of insertions and reads in the central 80% of the genes, etc.

Currently it needs to be investigated how to determine a relevant window size and the codes needs to be properly tested and verified.

leilaicruz · 2020-10-03T16:27:38Z

To complete the fitness calculation we would also need the data from the normalized reads of the mutant : dpl1d

leilaicruz · 2020-10-05T10:57:13Z

Normalized reads from the mutant (https://github.com/Gregory94/LaanLab-SATAY-DataAnalysis/tree/dev_Leila/Python_scripts/fitness_from%20SATAY_codes)
It will be nice to add the standard name of the feature name, so we can easily identify ORFs

Gregory94 · 2020-10-05T13:45:55Z

I added the gene aliases in the dataframe dna_df2 (see genomicfeatures_dataframe_with_normalization.py). This thus now includes the systematic gene names (e.g. YLR229C) and the standard name(s) (e.g. CDC42).

Wteunisse · 2020-10-16T08:45:41Z

An easy addition to this, as discussed in the meeting with Werner, is to divide the read number by the number of transposons, given that we have multiple transposons and reads added up as a result of the bin size or a moving average. This should help us to get rid of the peak around the centromere because this peak is only due to an increased number of transposons and not due to more reads per transposon.

Wteunisse · 2020-10-16T08:49:30Z

Leila commented here that we could check if this is useful by calculating the variation in the supposed neutral regions over the chromosome, as after this normalization we would expect little variation in the number of reads over all the neutral regions.

Gregory94 · 2020-10-16T08:55:18Z

yes, I have already added this to the dataframe variable (see also the python notebook of the same code) using the name Nreadsperinsrt and I also did this for the truncated gene where I ignore the first and last 100bp of the genes. I took the number of transposons per genomic feature (i.e. gene, telomere, noncoding region etc.). I haven't done any statistics on this data yet.

Gregory94 · 2020-10-23T13:38:39Z

See also this question at the satay forum about normalization.

Gregory94 created this issue from a note in SATAY-analysis-workflow-board (In progress) Sep 3, 2020

Gregory94 self-assigned this Sep 3, 2020

Gregory94 added data exploration data processing labels Sep 3, 2020

Gregory94 changed the title ~~Normalizing number of insertions and reads per CDS using the number of insertions in local neutral DNA.~~ Estimating fitness level for each insertion based on the number of reads Sep 18, 2020

leilaicruz changed the title ~~Estimating fitness level for each insertion based on the number of reads~~ Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes Oct 9, 2020

leilaicruz assigned Wteunisse Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21

Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21

Gregory94 commented Sep 3, 2020 •

edited

Gregory94 commented Sep 3, 2020 •

edited

Gregory94 commented Sep 18, 2020 •

edited

leilaicruz commented Sep 18, 2020

Gregory94 commented Oct 2, 2020

leilaicruz commented Oct 3, 2020

leilaicruz commented Oct 5, 2020 •

edited by Gregory94

Gregory94 commented Oct 5, 2020

Wteunisse commented Oct 16, 2020

Wteunisse commented Oct 16, 2020

Gregory94 commented Oct 16, 2020

Gregory94 commented Oct 23, 2020

Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21

Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21

Comments

Gregory94 commented Sep 3, 2020 • edited

Gregory94 commented Sep 3, 2020 • edited

Gregory94 commented Sep 18, 2020 • edited

leilaicruz commented Sep 18, 2020

Gregory94 commented Oct 2, 2020

leilaicruz commented Oct 3, 2020

leilaicruz commented Oct 5, 2020 • edited by Gregory94

Gregory94 commented Oct 5, 2020

Wteunisse commented Oct 16, 2020

Wteunisse commented Oct 16, 2020

Gregory94 commented Oct 16, 2020

Gregory94 commented Oct 23, 2020

Gregory94 commented Sep 3, 2020 •

edited

Gregory94 commented Sep 3, 2020 •

edited

Gregory94 commented Sep 18, 2020 •

edited

leilaicruz commented Oct 5, 2020 •

edited by Gregory94