TdG09 program: identifying changes in selective constraints

This program is an implementation of the model described in:

Tamuri AU, dos Reis M, Hay AJ, Goldstein RA (2009) Identifying Changes in Selective Constraints: Host Shifts in Influenza. PLoS Comput Biol 5(11): e1000564. doi:10.1371/journal.pcbi.1000564

This phylogenetic model uses site-specific amino acid frequencies to distinguish patterns of substitution between two (or more) lineages or groups of taxa. We first estimate the site-specific amino acid frequencies at a given location in a protein alignment assuming that the pattern of substitution is homogeneous across all branches. We then estimate multiple sets of site-specific frequencies, allowing them to differ among different branches, producing a non-homogeneous model of evolutionary change. Using statistical tests, we then see whether the non-homogeneous model provides a significantly better fit to the data than the homogenous model.

Tutorial

Install Java: The program requires a recent version of the Java Runtime (JRE), which, if not already installed, can be downloaded from Oracle. Linux packages are usually available in the distribution's repository (e.g. sudo apt-get install openjdk-7-jre for Debian, Ubuntu etc. distributions).

Download the program: The latest version of the program is available for download from the repository. The download includes a compiled binary as well as a build.xml to compile from sources using ant. Unzip the download and check that the program works:

$ ls -F
README.md  build.xml  dist/  etc/  lib/  src/

$ java -cp dist/tdg09.jar tdg09.Analyse
Usage: java -cp tdg09.jar tdg09.Analyse [options]
 Options:
 * -alignment
      Alignment in PHYLIP format
 * -groups
      Group labels to partition tree e.g. Av Hu
   -threads
      Number of threads to use
      Default: 1
 * -tree
      Tree in NEWICK format

Preparing your data: The program requires a protein sequence alignment in PHYLIP format and the corresponding tree in Newick format. The tree branch lengths should be optimised for an amino acid model, such as WAG, using a program such as RAxML or PAML. Most importantly, sequence names must be prefixed by a two-letter identifier that is used to indicate its lineage/grouping. For example, to identify changes in selective constraints between avian and human flu viral proteins, we can prefix every sequence name with 'Av' or 'Hu' to indicate its lineage:
```
$ cat etc/H1.faa
434 566
Hu_HA_AAX56530_H1N1      MKAKLLVLLCAFTATYADTI...
Hu_HA_AAY78939_H1N2      MKVKLLILLCTFTATYADTI...
Av_HA_ABB19507_H1N6      MEAKLFVLFCTFTVLKADTI...
Av_HA_ABB19518_H1N1      MEAKLFVLFCTFTALKADTI...
...
```

Both the sequence alignment and the tree must follow this convention.

```
$ cat etc/H1.tree
((Av_HA_ABB19607_H1N1:0.0852880,Av_HA_ABG88212_H1N1:0.1036700):
0.0248320,((Av_HA_ABB19618_H1N1:...
```

This is the only way by which the tdg09 program determines which of the different non-homogeneous models a particular tree branch should use. Example data sets of flu viral proteins (used in Tamuri et al. 2009) are included in the etc/ directory.

Running the program: The command-line options for the program are:
- -alignment : the sequence alignment file in PHYLIP format e.g. etc/H1.faa
- -tree : the tree file in Newick format e.g. etc/H1.tree
- -groups : the two-letter identifiers used to partition the sequences e.g. Av Hu
- -threads : specify the numbers of CPU cores/threads to utilise e.g. 2
The program prints messages to standard out, so this should be captured using > or tee. We are now ready to run the program:
```
$ java -cp dist/tdg09.jar tdg09.Analyse -alignment etc/H1.faa \
-tree etc/H1.tree -groups Av Hu -threads 2 > H1_out.txt
```
or if you have 'tee' installed:
```
$ java -cp dist/tdg09.jar tdg09.Analyse -alignment etc/H1.faa \
-tree etc/H1.tree -groups Av Hu -threads 2 | tee H1_out.txt
```

Inspect the results: In this example, program output is captured in H1_out.txt:

$ cat H1_out.txt
StartTime: 2013-03-19 13:44:10.129
WorkingDirectory: /Users/Tester/Documents/tdg09
Options: -alignment etc/H1.faa -tree etc/H1.tree -groups Av Hu -threads 2 

TreeFile: /Users/Tester/Documents/tdg09/etc/H1.tree
AlignmentFile: /Users/Tester/Documents/tdg09/etc/H1.faa

Alignment:
  SequenceCount: 434
  SiteCount: 566

Groups: [Av, Hu]

# The internal nodes of the tree are not labelled. Labelling...
# Node 432 from [Av, Hu] resolved
# Node 432 from [Av, Hu] resolved
# Assuming that root of tree is in group [Av]
# Switching from group [Av] to [Hu] at branch 432..431

LabelledTree: >
    (((((((((((Av_HA_ABB19607_H1N1:0.0852880,Av_HA_ABG88212_H1N1:0.1036700)
    Av:0.0248320,((Av_HA_ABB19618_H1N1:0.0910450,Av_HA_ABG88201_H1N1:0.0810480)
    ...     
	Hu:0.0618410)Hu:0.0978830)Hu:0.0436020)Hu:0.0482480)Hu:0.0255390)Hu:0.0362960)
	Hu:0.0696670)Hu:0.0620790)Hu:0.0268645);

The output contains a "LabelledTree" that shows the inferred lineage for each ancestral node. This should be checked in a tree viewing program (such as Dendroscope) to make sure that the lineages are correct. If not, they can be modified and the analysis can be rerun with the new, custom-labelled, tree. The output continues:

```    
# 2013-03-19 22:44:10.33 - site 1 complete.
# 2013-03-19 22:44:11.366 - site 3 complete.
# 2013-03-19 22:44:11.366 - site 4 complete.    

...

LrtResults:
#   Site,  delta lnL,  dof, LRT,       FDR
- [  204,  21.850690,  3,   0.0000000, 0.0000003 ]
- [  169,  11.528011,  1,   0.0000016, 0.0001542 ]
- [  289,  10.733096,  2,   0.0000218, 0.0008550 ]
- [  252,  10.927903,  2,   0.0000180, 0.0008796 ]
- [    9,   8.225516,  1,   0.0000499, 0.0008895 ]
- [  300,  15.027774,  5,   0.0000144, 0.0009396 ]
- [   62,   8.261257,  1,   0.0000481, 0.0009423 ]
- [  303,   8.261836,  1,   0.0000480, 0.0010463 ]
- [  239,   9.647869,  2,   0.0000646, 0.0010545 ]
- [  315,   9.483944,  2,   0.0000761, 0.0011468 ]
- [  253,   8.262439,  1,   0.0000480, 0.0011764 ]

...

FullResults:
# Site, WAG+ssF params, WAG+ssF lnL, WAG+lssF params, WAG+lssF params, delta lnL, dof, LRT, FDR
- [    1,  NA,         NA, NA,          NA,         NA, NA,        NA,        NA ]
- [    2,  3,  -21.488149,  5,  -13.378828,   8.109321,  2, 0.0003007, 0.0025627 ]
- [    3,  2,  -48.561153,  3,  -47.763333,   0.797820,  1, 0.2065223, 0.3489515 ]
- [    4,  2,  -24.137478,  3,  -23.407924,   0.729554,  1, 0.2270721, 0.3708845 ]
- [    5,  2,  -12.611075,  3,  -12.456865,   0.154209,  1, 0.5786522, 0.6593943 ]
- [    6,  3,  -23.856049,  5,  -22.761440,   1.094609,  2, 0.3346706, 0.5084918 ]
- [    7,  2,  -44.839152,  3,  -43.761797,   1.077355,  1, 0.1421333, 0.2509740 ]
- [    8,  NA,         NA, NA,          NA,         NA, NA,        NA,        NA ]   

...

```
Of interest are the "LrtResults" and "FullResults" tables. 

The LrtResults lists polymorphic sites (on which the non-homogeneous model was estimated) and orders them by the false discovery rate (a correction on the likelihood ratio test P-value required by multiple hypothesis testing). At a given FDR cutoff (e.g. 0.05), these sites are those at which the non-homogeneous model provides a statistically significant improvement over the homogenous model, indicating that the patterns of substitution are different between the different groups/lineages. 

The FullResults table lists further results from all sites. This includes the log-likelihood for the WAG+ssF (site-specific frequencies or *homogeneous model*) and WAG+lssF (lineage and site-specific frequencies or *non-homogeneous model*). Conserved locations are not analysed, so their entries are 'NA'. Finally, after the FullResults table, the output file contains details of the per-location results, such as the amino acid frequencies estimated by the ssF and lssF models.

Analysing the results: The output file is in YAML format, which means it can be read by any other programming language that has a YAML parsing library. Here we show an example of analysing the results using the programming language R. The code is available in the src/R/example.R file. Start the R console, install and load the yaml library, then load the output file using the yaml.load_file function:

$ R
R version 2.15.3 (2013-03-01) -- "Security Blanket"
...
> install.packages("yaml")
Installing package(s) into ‘/Users/Tester/Library/R/2.15/library’
(as ‘lib’ is unspecified)
...
> library(yaml)
> out <- yaml.load_file(input='/Users/Tester/Documents/tdg09/H1_out.txt')
> summary(out)
                   Length Class  Mode
StartTime            1    -none- character
WorkingDirectory     1    -none- character
Options              1    -none- character
TreeFile             1    -none- character
AlignmentFile        1    -none- character
Alignment            2    -none- list
Groups               2    -none- character
LabelledTree         1    -none- character
LrtResults         196    -none- list
FullResults        566    -none- list
ConservedPositions   2    -none- list
SiteResults        566    -none- list
EndTime              1    -none- character

To convert the R list objects into a type easier to work with, we convert the LrtResults and FullResults objects into data.frames:

> lrt_results <- as.data.frame(matrix(unlist(out$LrtResults), ncol=5, byrow=T))
> names(lrt_results) <- c("site", "deltaLnL", "dof", "lrt", "fdr")
> head(lrt_results)
  site  deltaLnL dof      lrt       fdr
1  204 21.850690   3 0.00e+00 0.0000003
2  169 11.528011   1 1.60e-06 0.0001542
3  289 10.733096   2 2.18e-05 0.0008550
4  252 10.927903   2 1.80e-05 0.0008796
5    9  8.225516   1 4.99e-05 0.0008895
6  300 15.027774   5 1.44e-05 0.0009396
> sum(lrt_results$fdr <= 0.05) # how many sites identified with FDR < 0.05?
[1] 55
> full_results <- as.data.frame(matrix(unlist(out$FullResults), ncol=9, byrow=T))
> names(full_results) <- c("site", "ssfParams", "ssfLnL", "lssfParams", "lssfLnL", "deltaLnL", "dof", "lrt", "fdr")
> head(full_results)
  site ssfParams     ssfLnL lssfParams    lssfLnL deltaLnL dof       lrt       fdr
1    1        NA         NA         NA         NA       NA  NA        NA        NA
2    2         3 -21.488149          5 -13.378828 8.109321   2 0.0003007 0.0025627
3    3         2 -48.561153          3 -47.763333  0.79782   1 0.2065223 0.3489515
4    4         2 -24.137478          3 -23.407924 0.729554   1 0.2270721 0.3708845
5    5         2 -12.611075          3 -12.456865 0.154209   1 0.5786522 0.6593943
6    6         3 -23.856049          5  -22.76144 1.094609   2 0.3346706 0.5084918

To produce a plot of FDR values by site:

> fdr <- as.numeric(levels(full_results$fdr)[full_results$fdr]) # FDR column should be numeric
Warning message:
NAs introduced by coercion
> fdr[is.na(fdr)] <- 1.0 # conserved locations implicitly have no evidence of non-homogeneity
> sites <- out$Alignment$SiteCount
> plot_ranges <- split(seq(1, sites), cut(seq(1, sites), 5)) # split sites in plot into 5 rows
> par(mfrow=c(5,1), mar=c(2.0,0.5,0.5,0.5))
for (p in 1:5) {
	plot(1 - fdr,  
	xlim=c(plot_ranges[[p]][1], tail(plot_ranges[[p]], n=1)), 
	ty='h', lwd=1, main='', xlab='', ylab='', yaxt='n', col="#1B9E77")
	lines(which(fdr <= 0.20), 1 - fdr[fdr <= 0.20], col="#D95F02", ty='h')
	abline(h=0.95, lty='dashed')
	points(which(fdr <= 0.05), 1 - fdr[fdr <= 0.05], pch=20, col="#DE2D26")
}

This produces the plot show below. Bars drawn in orange indicate locations with FDR < 0.20, and those locations with FDR < 0.05 have a red dot at their value (we drew the plot with "1 - fdr" values so that smaller FDRs are taller). We can see from this plot a cluster of identified sites at locations 200-210.

The R code in src/R/example.R also provides an example of using the phangorn and ape packages to simulate data under the homogeneous model for a particular site. We can then analyse the synthetic data with the TdG09 software, the parametric bootstrap providing significance of the non-homogeneous model on the original data. For example, here is the Monte Carlo distribution of Δ for 1000 parametric bootstrap replicates for flu protein HA (H1) site 204, showing that the homogenous model can be rejected in favour of the non-homogeneous model with P-value < 0.001.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
dist		dist
etc		etc
lib		lib
src		src
.gitignore		.gitignore
README.html		README.html
README.md		README.md
build.xml		build.xml
figure_cox_plot.png		figure_cox_plot.png
figure_fdr_plot.png		figure_fdr_plot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dist

dist

etc

etc

lib

lib

src

src

.gitignore

.gitignore

README.html

README.html

README.md

README.md

build.xml

build.xml

figure_cox_plot.png

figure_cox_plot.png

figure_fdr_plot.png

figure_fdr_plot.png

Repository files navigation

TdG09 program: identifying changes in selective constraints

Tutorial

About

Releases 2

Packages

Languages

tamuri/tdg09

Folders and files

Latest commit

History

Repository files navigation

TdG09 program: identifying changes in selective constraints

Tutorial

About

Resources

Stars

Watchers

Forks

Languages