Skip to content

blab/cartography

Repository files navigation

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

Sravani Nanduri1, Allison Black2, Trevor Bedford2,3, John Huddleston2,4

  1. Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
  2. Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
  3. Howard Hughes Medical Institute, Seattle, WA, USA
  4. Corresponding author (jhuddles@fredhutch.org)

Preprint: https://doi.org/10.1101/2024.02.07.579374

Abstract

Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages Clusters from t-SNE most accurately recapitulated known phylogenetic clades and recombinant lineages. Both MDS and t-SNE accurately identified reassortment groups. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

Phylogenetic trees and embeddings

Explore the phylogenetic trees and embeddings on Nextstrain.

Interactive figures

Main figures

Supplemental figures

Supplemental tables

Full analysis

Installation

First, install Conda with the Miniconda distribution. Until Bioconda supports modern Mac CPUs, Mac users with M1/M2 CPUs (the ARM64 architecture) need to install the Mac Intel x86 Miniconda distribution and install Rosetta, so the workflow can run under Mac's emulation mode.

After installing Conda, create the environment for this project.

conda env create -f cartography.yml

Activate the environment prior to running the workflow below.

conda activate cartography

Next, you need to install Julia and then install TreeKnit following the instructions to install the "CLI" version. The TreeKnit binary installs in your home directory, by default, in the path ~/.julia/bin/treeknit. This path is what the project's workflow calls to run TreeKnit.

Notes for Windows users

If you are a Windows user, you will need to install WSL to run this project's workflow. You cannot put this github repository in the Users file. Snakemake sees /U as a unicodeescape error and will not run, so please make a folder outside of the Users folder (ex. directly in the C drive) where you install this github repository, anaconda, and all other dependencies.

Run the full analysis

Run the full analysis for the project which includes simulations, analysis of natural populations, and generation of the manuscript and its figures and tables.

snakemake \
    --use-conda \
    --conda-frontend conda \
    --cores all

This is a complex workflow, so it will take several hours to run.

About

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •