Assessing computational predictions of antimicrobial resistance phenotypes from microbial genomes

Introduction

software list

Aytan-Aktug [1],
Seq2Geno2Pheno (Seq2Geno&Geno2Pheno) [2],
PhenotypeSeeker v 0.7.3 [3],
Kover 2.0 [4],
ResFinder 4.0 [5], a direct association software based on AMR determinant database, was used as the baseline.

Datasets

Dataset overview
Genome list of each single-species-antibiotic dataset in the form of Data_<species>_<antibiotic>
Genome phenotype metadata of each species-antibiotic combination in the form of Data_<species>_<antibiotic>_pheno.txt.
Evaluation folds
Tutorials for creating AMR benchmarking datasets
Mapping from PATRIC ID to NCBI and GenBank ID

Prerequirements

Dependencies
- To reproduce the output, you need to use Linux OS and conda. Miniconda2 4.8.4 was used by us. All software environments were activated under "base" env, which is the default environment.
- Installation of the conda environments:
```
git clone https://github.com/hzi-bifo/AMR_benchmarking.git
cd AMR_benchmarking
bash ./install/install.sh #Create 9 pieces of conda environments and install packages respectively
```
- For Kover, please refer to Kover to try other installation methods.
- Finally, you need to install PyTorch in the multi_torch_env manually. To install PyTorch compatible with your CUDA version, please follow this instruction: https://pytorch.org/get-started/locally/. Our code was tested with pytorch v1.7.1, with CUDA Version 10.1 and 11.0 .
Memory requirement: Some procedures require extremely large memory. Aytan-Aktug multi-species model (adapted version) feature-building procedure needs ~370G memory. Other ML software needs up to 80G memory, depending on the number of CPUs and the specific species-antibiotic combination.
Disk storage requirement: Some procedures generate extremely large intermediate files, although they are deleted once finished in our pipeline. E.G. PhenotypeSeeker(adapted version) needs the most disk storage, which is up to the magnitude of 10T depending on the species.

Input file

The input file is a YAML file Config.yaml at the root folder where all options are described:

A. Basic/required parameters setting

Please change everything in A after the ":" to your own.

option	action	values ([default])
dataset_location	To where the PATRIC data will be downloaded. ~246G	/vol/projects/BIFO/patric_genome
output_path	To where to generate the `Results` folder for the direct results of each software and further visualization.	./
log_path	To where to generate the `log` folder for the intermediate files (~10 TB, while regularly cleaning files related to completed benchmarking species).	./
n_jobs	CPU cores (>1) to use.	10
gpu_on	GPU possibility for Aytan-Aktug SSSA model, If set to False, parallelization on CPU will be applied; Otherwise, it will be applied on one gpu core sequentially.	False
clean_software	Clean large intermediate files of the specified software (optional). Large temp files can also be manually removed from `<log_path>/log/software/<software_name>/software_output`.

B.Optional parameters setting

Please change the conda environment names if the same names already exist in your working PC.

option	action	values ([default])
amr_env_name,amr_env_name2	conda env for general use	amr_env,amr2
PhenotypeSeeker_env_name	conda env for PhenotypeSeeker	PhenotypeSeeker_env
multi_env_name	conda env for	multi_env
multi_torch_env_name	conda env for NN model	multi_torch_env
kover_env_name	conda env for Kover	kover_env
se2ge_env_name	conda env for Seg2Geno	snakemake_env
kmer_env_name	conda env for Seg2Geno k-mers generation	kmer_kmc
phylo_name	conda env for Seg2Geno phylogenetic trees generation	phylo_env
phylo_name2	conda env for visualization of misclassified genomes	phylo_env2
resfinder_env	conda env for ResFinder	res_env

C. Advanced/optional parameters setting

You can evaluate for a subset of species at a time by modifying the values of the 'species_list', 'species_list_phylotree', and 'species_list_multi_antibiotics' options.
For multi-species models , we have listed all the possible species in terms of dataset this study provides; you can explore as you like by making new combinations of the listed species. Users, who would like to reproduce this AMR benchmarking results, are not advised to change settings in this category.

option	action	values ([default])
species_list	Benchmarked species under random and homology-aware folds for single-species evaluation	Escherichia_coli, Staphylococcus_aureus, Salmonella_enterica, Klebsiella_pneumoniae, Pseudomonas_aeruginosa, Acinetobacter_baumannii, Streptococcus_pneumoniae, Mycobacterium_tuberculosis, Campylobacter_jejuni, Enterococcus_faecium, Neisseria_gonorrhoeae
species_list_phylotree	Benchmarked species under phylogeny-aware folds for single-species evaluation	Escherichia_coli, Staphylococcus_aureus, Salmonella_enterica, Klebsiella_pneumoniae, Pseudomonas_aeruginosa, Acinetobacter_baumannii, Streptococcus_pneumoniae, Campylobacter_jejuni, Enterococcus_faecium, Neisseria_gonorrhoeae
species_list_multi_antibiotics	Benchmarked species for single-species multi-antibiotic model.	Mycobacterium_tuberculosis, Escherichia_coli, Staphylococcus_aureus, Salmonella_enterica, Klebsiella_pneumoniae, Pseudomonas_aeruginosa, Acinetobacter_baumannii, Streptococcus_pneumoniae, Neisseria_gonorrhoeae
species_list_multi_species	Benchmarked species for multi-species models.	Mycobacterium_tuberculosis, Salmonella_enterica, Streptococcus_pneumoniae, Escherichia_coli, Staphylococcus_aureus, Klebsiella_pneumoniae, Acinetobacter_baumannii, Pseudomonas_aeruginosa, Campylobacter_jejuni
cv_number	The k value of k-fold nested cross-validation	10
QC_criteria	Sample quality control level. Can be loose or strict.	loose

Output

└── Results
    ├── final_figures_tables
    ├── other_figures_tables
    ├── supplement_figures_tables    
    └── software
        ├── AytanAktug
        ├── kover
        ├── majority
        ├── phenotypeseeker
        ├── resfinder_b
        ├── resfinder_folds
        ├── resfinder_k
        └── seq2geno

Cross-validation results of each ML software and evaluation results of Resfinder are generated under output_path/Results/<name of the software>.
Visualization tables and graphs are generated under output_path/Results/final_figures_tables and output_path/Results/supplement_figures_tables.
Numbers and statistic results mentioned in our benchmarking article are generated under output_path/Results/other_figures_tables.

Usage

git clone https://github.com/hzi-bifo/AMR_benchmarking.git
cd AMR_benchmarking
bash main.sh #details of usage were explained in main.sh. You can't finish the whole AMR benchmarking just by setting this command to run once.
bash ./scripts/model/clean.sh # Optional. Clean intermediate files

One could see main.sh for benchmarking workflow.

One could use clean.sh to clean large and less important intermediate files. You can run it any time after the specified software finishes running on a benchmarked species. Don't use it when the corresponding software is running on a new benchmarked species.

References

[1] D Aytan-Aktug, Philip Thomas Lanken Conradsen Clausen, Valeria Bortolaia, Frank Møller Aarestrup, and Ole Lund. Prediction of acquired antimicrobial resistance for multiple bacterial species using neural networks.Msystems, 5(1), 2020.

[2] Ariane Khaledi, Aaron Weimann, Monika Schniederjans, Ehsaneddin Asgari, Tzu-Hao Kuo, Antonio Oliver, Gabriel Cabot, Axel Kola, Petra Gastmeier, Michael Hogardt, et al. Predicting antimicrobial resistance in pseudomonas aeruginosa with machine learning-enabled molecular diagnostics. EMBO molecular medicine, 12(3):e10264, 2020.

[3] Erki Aun, Age Brauer, Veljo Kisand, Tanel Tenson, and Maido Remm. A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria. PLoS computational biology, 14(10):e1006434, 2018.

[4] Alexandre Drouin, Gaël Letarte, Frédéric Raymond, Mario Marchand, Jacques Corbeil, and François Laviolette. Interpretable genotype-to-phenotype classifiers with performance guarantees. Scientific reports, 9(1):1–13, 2019.

[5] Valeria Bortolaia, Rolf S Kaas, Etienne Ruppe, Marilyn C Roberts, Stefan Schwarz, Vincent Cattoir, Alain Philippon, Rosa L Allesoe, Ana Rita Rebelo, Alfred Ferrer Florensa, et al. Resfinder 4.0 for predictions of phenotypes from genotypes. Journal of Antimicrobial Chemotherapy, 75(12): 3491–3500, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 1,067 Commits
AMR_software		AMR_software
data		data
doc		doc
install		install
scripts		scripts
src		src
Config.yaml		Config.yaml
LICENSE		LICENSE
README.md		README.md
main.sh		main.sh

License

hzi-bifo/AMR_benchmarking

Folders and files

Latest commit

History

Repository files navigation

Assessing computational predictions of antimicrobial resistance phenotypes from microbial genomes

Contents

Introduction

software list

Datasets

Prerequirements

Input file

Output

Usage

References

License

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Languages