-
Notifications
You must be signed in to change notification settings - Fork 3
Normalization approaches to take meaningul values of reads per transposon insertion throughout the chromosomes #21
Comments
Currently I'm developing a tool that determine this local background by searching, for each CDS, the local neutral regions (i.e. regions that do not seem to code for anything special like genes, telomeres, ARS etc.). The current python script can be found here. |
I made a first attempt in estimating the fitness of the cells based on number of reads. I went for the most simple approach which is dividing a chromosome in sections (which I called windows) and normalized the data within each window (see attached file). I did the normalization like this because the likelyhood of an insertion is not constant throughout chromosome, so some regions (typically around the centrosomes) typically bare more transposons. After I got a distribution that looked more flat (e.g. the average went from 0.69+/-1.53 reads per basepair per region before normalization to 0.66+/-1.33 after normalization). Finally, I divided all reads by the largest bar (assuming these cells had the highest fitness), which gives me a value between 0 and 1 which I then define as the fitness level. I think this approach might be not so good after all, since it is a bit dubious how to define the windows for normalization. |
Great Greg! |
Normalization of the reads is updated with a method introduced in the paper by Gallagher et.al. 2011. This is done in the code genomicfeatures_dataframe_with_normalization. The normalization equation is given in the attached figure. This code creates a dataframe that contains all important information, the feature name and type (e.g. gene, telomeres, centromeres etc.), essentiality, genomic position, length, number of insertions, number of reads, normalized number of reads, number of insertions and reads in the central 80% of the genes, etc. Currently it needs to be investigated how to determine a relevant window size and the codes needs to be properly tested and verified. |
To complete the fitness calculation we would also need the data from the normalized reads of the mutant : dpl1d |
|
I added the gene aliases in the dataframe dna_df2 (see genomicfeatures_dataframe_with_normalization.py). This thus now includes the systematic gene names (e.g. YLR229C) and the standard name(s) (e.g. CDC42). |
An easy addition to this, as discussed in the meeting with Werner, is to divide the read number by the number of transposons, given that we have multiple transposons and reads added up as a result of the bin size or a moving average. This should help us to get rid of the peak around the centromere because this peak is only due to an increased number of transposons and not due to more reads per transposon. |
Leila commented here that we could check if this is useful by calculating the variation in the supposed neutral regions over the chromosome, as after this normalization we would expect little variation in the number of reads over all the neutral regions. |
yes, I have already added this to the dataframe variable (see also the python notebook of the same code) using the name |
See also this question at the satay forum about normalization. |
For estimating the fitness level for each insertion, the data needs to be normalized to compensate for the fact that the likelihood for transposon insertions is not constant throughout the chromosomes.
After normalization, the neutral regions are expected to have similar number of reads. Using this result, a fitness level can be defined, ideally between 0 and 1 where 1 is defined by the cells that have the most number of reads.
The text was updated successfully, but these errors were encountered: