Skip to content

aghozlane/masque

Repository files navigation

MASQUE : Metagenomic Analysis with a Quantitative pipeline

Amine Ghozlane (amine.ghozlane@pasteur.fr) (@xealf8)

Contents

Introduction

The aim of this project is to provide an easy cluster interface to perform targeted metagenomic analysis. MASQUE allows :

  • to analyse 16S/18S/23S/28S/ITS data. It builds a count matrix, an annotation table and a phylogeny of the OTU.
  • to perform an "uptodate" analysis considering the scientific literature. Parameters have been already tested on numerous projects.

Process

We follow the recommandation described by Robert C. Edgar in Uparse supplementary paper.
Shortly, the clustring process is performed in 4 main steps in MASQUE :

  1. Read quality control
  2. Dereplication
  3. Chimera filtering
  4. Clustering
  5. Realignment/mapping
  6. Taxonomical annotation of the OTU
  7. Quality check of every step

You can find more information in the presentation here. We try to describe the idea behind each step and a complete TP to do it on your own.

Installation

Docker install

The easiest way to use MASQUE is the docker as following:

wget ftp://shiny01.hosting.pasteur.fr/pub/database.zip
unzip database.zip -d /path/to/databases
docker run -i -t -v /path/to/fastq-data:/mydata -v /path/to/databases:/usr/local/bin/databases/ aghozlane/masque

Replace /path/to/fastq-data by a directory containing the reads and /path/to/databases by the directory containing the databases. Data are stored in /mydata. MASQUE program is directly accessible.

masque -i /mydata/ -o /mydata/result/

Git install (deprecated)

MASQUE comes with many binaries for Linux 64 bits. It will always use your existing installed versions if they exist, but will use the included ones if that fails. You can consult the list of dependencies later in this document. For the correct deployment by git, install first git-lfs.

sudo ./git-lfs-1.2.1/install.sh
git lfs install

Then, you can clone masque :

git clone https://github.com/aghozlane/masque.git

Only biom program need to be installed by the user :

pip install biom-format

Then, install the databases as follow :

/bin/bash install_databases.sh

Command line options

masque -h
16S/18S: /bin/bash masque.sh -i </path/to/input/directory/> -o </path/to/result/directory/>
23S/28S: /bin/bash masque.sh -l -i </path/to/input/directory/> -o </path/to/result/directory/>
ITS: /bin/bash masque.sh -f -i </path/to/input/directory/> -o </path/to/result/directory/>
Amplicon: /bin/bash masque.sh -a <amplicon file> -o </path/to/result/directory/>
- All parameters:
-i      Provide </path/to/input/directory/>
-a      Provide <amplicon file>
-o      Provide </path/to/result/directory/>
-n      Indicate <project-name> (default: use the name of the input directory)
-t      Number of <thread> (default all cpu will be used)
-c      Contaminant filtering [danio,human,mouse,mosquito,phi] (Default: human,phi)
-s      Perform OTU clustering with swarm
-b      Perform taxonomical annotation with blast (Default vsearch)
-l      Perform taxonomical annotation against LSU databases: Silva/RDP
-f      Perform taxonomical annotation against ITS databases: Unite/Findley/Underhill/RDP
--minreadlength Minimum read length take in accound in the study (Default 35nt)
--minphred      Qvalue must lie between [0-40] (Default minimum qvalue 20)
--minphredperc  Minimum allowed percentage of correctly called nucleotides [0-100] (Default 80)
--NbMismatchMapping     Maximum number of mismatch when mapping end-to-end against Human genome and Phi174 genome (Default 1 mismatch is accepted)
--maxoverlap    Maximum overlap when paired reads are considered (Default 200 nt)
--minoverlap    Minimum overlap when paired reads are considered (Default 50 nt)
--minampliconlength     Minimum amplicon length (Default 64nt)
--minotusize    Indicate minimum OTU size (Default 4)
--prefixdrep    Perform prefix dereplication (Default full length dereplication)
--chimeraslayerfiltering        Use ChimeraSlayer database for chimera filtering (Default : Perform a de novo chimera filtering)
--otudiffswarm  Number of difference accepted in an OTU with swarm (Default 1)
--evalueTaxAnnot        evalue threshold for taxonomical annotation with blast (Default evalue=1E-5)
--maxTargetSeqs Number of hit per OTU with blast (Default 1)
--identityThreshold    Identity threshold for taxonomical annotation with vsearch (Default 0.75)
--conservedPosition Percentage of conserved position in the multiple alignment considered for phylogenetic tree (Default 0.8)
--accurateTree  Accurate tree calculation with IQ-TREE instead of FastTree (Default FastTree)

SGE and SLURM deployments

Template scripts are provided for SGE and SLURM deployments :

masque_16S-18S.sh
masque_16S-18S_tars.sh
masque_23S-28S.sh
masque_23S-28S_tars.sh
masque_ITS.sh
masque_ITS_tars.sh
masque_amplicon.sh
masque_amplicon_tars.sh

For users from Institut Pasteur, please consider the README_PASTEUR.

Results

In the output_dir, you will find after calculation :

File Description
project_stat_process.txt Every step progress (during calculation : tail -f project-name_stat_process.txt, at the end : less project-name_stat_process.txt)
project_annotation_process.tsv Summary of the annotation process
project_build_process.tsv Summary of the otu-build process (Number reads, contaminants and OTU identified per samples...)
project_otu.fasta OTU centroid sequence in fasta format
project_otu_table.tsv Count table including the raw count obtained for each OTU and each sample
project_vs_database_annotation_eval_val.tsv OTU annotation performed by blast against the several databank
project_database_eval_val.biom Biom file including the count and the annotation
project_vs_rdp.tsv OTU annotation performed by rdp.
project_otu_*_bmge.ali.treefile OTU phylogeny generated for sequence annotated by the databases
reads/*_fastqc.html fastq quality after trimming/clipping

The other files correspond to intermediate results.

Dependencies

  • AlienTrimmer
    Performs the trimming and clipping of the reads
    Criscuolo, A., Brisse, S., AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads., Genomics, 2013, 102(5), 500-506.
  • Biom
    Combine count matrix and taxonomical annotation table
    Daniel McDonald, et al., The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. GigaScience, 2012, 1:7. doi:10.1186/2047-217X-1-7
  • Blastn
    Performs the taxonomical annotation
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. "Basic local alignment search tool." J. Mol. Biol., 1990, 215:403-410.
  • Bowtie2
    Finds contaminants
    Langmead B, Salzberg S0,. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.
  • BMGE
    Select informative regions in multiple sequence alignments
    Criscuolo, A., Gribaldo, S., BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC evolutionary biology, 2010, 10(1), 1.
  • Fastqc
    Checks read quality.
    Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data.
  • Fasttree
    Computes the phylogeny of OTU sequences (Default selection).
    Price, M.N., Dehal, P.S., and Arkin, A.P., FastTree 2 -- Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 2010, 5(3):e9490.
  • FLASH
    Merges paired reads to get amplicons.
    Magoc T. and Salzberg S.., FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:21, 2011, 2957-63.
  • IQTREE
    Computes the phylogeny of OTU sequences. Nguyen, L. T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q., Iq-tree: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular biology and evolution, 2015, 32(1), 268-274.
  • Mafft
    Performs a multiple alignment of OTU sequences.
    Katoh, Misawa, Kuma, Miyata, MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 2002, 30:3059-3066.
  • rdp classifier
    Performs taxonomical annotation for 16S, 18S, ITS.
    Wang, Q, Garrity G.M., Tiedje J. M., and ColeJ. R., Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol., 2007, 73(16):5261-5267.
  • swarm
    Performs OTU clustering.
    Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M., Swarm v2: highly-scalable and high-resolution amplicon clustering. PeerJ, 2015.
  • vsearch
    Performs OTU clustering (Default selection).
    Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F., VSEARCH: a versatile open source tool for metagenomics. PeerJ, 2016, 4, e2584.

Databases

MASQUE use several databases for taxonomical annotation and data filtering as follow :

Taxonomical annotation

  • FINDLEY
    Used for the taxonomical annotation of ITS sequences.
    Findley, K., et al., Topographic diversity of fungal and bacterial communities in human skin. Nature, 2013, 498(7454), 367-370.
    http://www.mothur.org/wiki/Findley_ITS_Database
  • GREENGENES
    Used for the taxonomical annotation of 16S, 18S sequences.
    DeSantis, T. Z., et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and environmental microbiology, 2006, 72(7), 5069-5072.
    http://greengenes.secondgenome.com/downloads/database/13_5
  • SILVA LSU, SSU
    Used for the taxonomical annotation of 16S, 18S, 23S, 28S sequences.
    Pruesse, E., et al., SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic acids research, 2007, 35(21), 7188-7196.
    https://www.arb-silva.de/
  • UNDERHILL
    Used for the taxonomical annotation of ITS sequences.
    Tang J, Iliev I, Brown J, Underhill D and Funari V. Mycobiome: Approaches to Analysis of Intestinal Fungi. Journal of Immunological Methods, 2015, 421:112-21.
    https://risccweb.csmc.edu/microbiome/thf/
  • UNITE
    Used for the taxonomical annotation of ITS sequences.
    Abarenkov, K., et al., The UNITE database for molecular identification of fungi–recent updates and future perspectives. New Phytologist, 2010, 186(2), 281-285.
    https://unite.ut.ee/repository.php

Filtering databases

  • AlienTrimmer
    Adapters sequences provided by Illumina and Life technologies.
  • GOLD
    ChimeraSlayer reference database used for chimera filtering (default mode use de novo filtering instead)
    http://drive5.com/uchime/uchime_download.html
  • NCBI Anopheles stephensi, Danio rerio, Homo Sapiens, Mus Musculus, PhiX174
    Used to search for host or manipulator contamination. Phi phage used for the sequencing.

Test

Samples from the mock communities are available for testing [The NIH HMP Working Group, 2009].

Label Community
SRR053818 even
SRR072220 even
SRR072221 even
SRR072223 staggered
SRR072237 staggered
SRR072239 staggered

Mock communities are composed of 21 species mixed in even or staggered proportions :

You can run:

gunzip test/data/*.gz
/bin/bash ./masque.sh -i test/data -o test/result

The results can be visualized with SHAMAN and compared with the results obtained with the MOCK reference genome available in test/mock/.

Bugs

All bug reports are highly appreciated. You may submit a bug report here on GitHub as an issue or send an email to amine.ghozlane@pasteur.fr.

Citation

No papers about MASQUE alone is published for the moment, but you can cite the first publication that use this program:

  • A bacteriocin from epidemic Listeria strains alters the host intestinal microbiota to favor infection. Quereda JJ, Dussurget O, Nahori MA, Ghozlane A, Volant S, Dillies MA, Regnault B, Kennedy S, Mondot S, Villoing B, Cossart P, Pizarro-Cerda J.; PNAS 2016. PUBMED.

Acknowledgements

Thanks to Emna Achouri - emna.achouri@pasteur.fr for tars support.