Skip to content

SARS-CoV-2 variant calling and consensus assembly pipeline

License

Notifications You must be signed in to change notification settings

onecodex/sars-cov-2

Repository files navigation

SARS-CoV-2 variant calling

Actions Status Actions Status Docker Repository on Quay

This pipeline performs consensus assembly and variant calling for amplicon sequencing data (Illumina or Oxford Nanopore) generated using the ARTIC protocol. The user can specify a primer set to be used for trimming alignments; this is assumed to be ARTIC V4.1 if not specified.

Pipeline overview

The pipeline takes in a single FASTQ file (interleaved if Illumina) and processes it as follows:

  1. Map reads to the Wuhan-Hu-1 reference and trim ARTIC primer sequences
  2. Generate a consensus sequence (bcftools for Illumina; medaka for Oxford Nanopore)
  3. Call variants with a limit of 200x coverage, as recommended by the ARTIC network. While indels and SNVs are reported for Illumina data, only SNVs are reported for Oxford Nanopore based on benchmarking studies that indicate small indel detection is unreliable.
  4. Assign Pangolin and Nextclade lineages
  5. Predict amino acid mutations
  6. Predict consequences of compound variants (ex: adjacent SNVs on the same codon; frame-shifting indels followed by frame-restoring indels)

Rigorous quality checks are implemented throughout the pipeline, including flagging of variants in low complexity regions for error-prone Oxford Nanopore data and conservative lineage calls (no lineage assignments will be reported if the consensus sequence has too many N’s or is too fragmented).

In addition to a results JSON, a PDF report is generated for each sample that tells you at a glance whether primer dropout has occurred, which amino acid mutations are present, and whether the sample contains a variant of concern. An example report is shown below.

Example report

Quick start

docker build -t covid19 .

Run the pipeline in the Docker image (note that fastq files are stored in git lfs so you may need to git lfs pull before executing):

docker \
  run \
  --rm \
  --workdir /data \
  --volume `pwd`:/data \
  --entrypoint /bin/bash \
  --env ONE_CODEX_REPORT_FILENAME=report.pdf \
  --env INSTRUMENT_VENDOR=Illumina \
  --env ARTIC_PRIMER_VERSION=4.1 \
  covid19 \
  jobscript.sh \
  data/twist-target-capture/RNA_control_spike_in_10_6_100k_reads.fastq.gz

For Oxford Nanopore:

docker \
  run \
  --rm \
  --workdir /data \
  --volume `pwd`:/data \
  --entrypoint /bin/bash \
  --env ONE_CODEX_REPORT_FILENAME=report.pdf \
  --env INSTRUMENT_VENDOR="Oxford Nanopore" \
  --env ARTIC_PRIMER_VERSION=4.1 \
  covid19 \
  jobscript.sh \
  data/twist-target-capture/RNA_control_spike_in_10_6_100k_reads.fastq.gz

Development & Testing

To run tests, run pytest.

The requirements.txt file lists dependencies for quickly running some golden output tests across a variety of datasets. This repository is set up to use Github Actions to automatically build the Docker image and run these tests, to ensure that parameter and pipeline changes don't affect variant calls or consensus sequence generation.

Currently, integration tests are run on:

  • Simulated Illumina data from the SARS-CoV-2 reference including simulated variants across the genome
  • Example Twist hybrid capture data (Illumina)
  • Example ARTIC v1 amplicon sequencing data (Illumina)

It also uses pre-commit to keep things clean and orderly. To get started, first install the requirements (Python 3 required): pip install -r requirements.txt. Then install the pre-commit hooks: pre-commit install --install-hooks.

Acknowledgments

Many thanks are due across the community, including but not limited to: