Skip to content
/ stag Public

A hierarchical taxonomic classifier for metagenomic sequences

Notifications You must be signed in to change notification settings

zellerlab/stag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



This tool is design to classify metagenomic sequences (marker genes, genomes and amplicon reads) using a Hierarchical Taxonomic Classifier.

Please check also the wiki for more information.





Dependencies

The stag classifier requires:

  • Python 3.7 (or higher)
  • HMMER3 (or Infernal)
  • Easel (link)
  • seqtk
  • prodigal (to predict genes in genomes)
  • python library:
    • numpy
    • pandas
    • sklearn
    • h5py = 2.10.0

If you have conda, you can install all the dependencies in conda_env_stag.yaml. See Installation wiki for more info.

Installation

git clone https://github.com/zellerlab/stag.git
cd stag
# if environment is needed
conda env create -f conda_env_stag.yaml
python setup.py bdist_wheel
pip install --no-deps --force-reinstall dist/*.whl

Note: in the following examples we assume that the python script stag is in the system path.

Execution

# if environment was installed
conda activate stag
# test the installation
stag test

Taxonomically annotate gene sequences

Given a fasta file (let's say unknown_seq.fasta), you can find the taxonomy annotation of these sequences using:

stag classify -d test_db.stagDB -i unknown_seq.fasta

The output is:

sequence	taxonomy
geneA	d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
geneB	d__Bacteria
geneC	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria

You can either create a database (see Create a database), or use one that we already compiled:

Taxonomically annotate genomes

Given a fasta file (let's say unknown_genome.fasta), you can find the taxonomy annotation of this genome with:

stag classify_genome -i unknown_genome.fasta -d gtdb_30.stagDB -o res_dir

The output is saved in the directory res_dir. Inside you will find the file genome_annotation with the annotation in the same format as in the gene classification. More information on the other files can be found here.

To classify multiple genomes, you can use:

stag classify_genome -D all/genomes/dir -d gtdb_30.stagDB -o res_dir

Where all/genomes/dir is a directory, and all fasta files inside the directory will be classified.

Finally, you can find some databases to classify genomes (gtdb_30.stagDB in the examples) here.

About

A hierarchical taxonomic classifier for metagenomic sequences

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages