Skip to content

quiltdata/quilt-package-metadata-athena

Repository files navigation

Streamlining NGS Insights: From Raw Data to Athena Queries with Quilt Packages and Metadata

In this tutorial series, we demonstrate the benefit of packaging raw -omics data in Quilt packages with attached sample-level metadata. Annotating packages with workflow-standardized metadata enables the creation of AWS Athena tables, joining sample-level metadata & pipeline outputs (e.g. Nextflow) from your processed NGS data. Together, joining these two data sources in Athena allows users to query large datasets across multiple processing runs and cohorts efficiently using SQL.

For example, use an SQL query within a Jupyter Notebook to generate a table of EGFR expression across all colon cancer cell lines, where "colon cancer" represents a piece of sample-level metadata from the raw data Quilt packages, and "EGFR expression" is a piece of processed data from packaged Nextflow pipeline outputs.

The ultimate goal of this demo is to provide an end-to-end framework, from raw data to analysis, to maximize the utility of your NGS data, and make querying your datasets fast & easy (no more searching through directories & file systems to find specific sample or run IDs!).

Dataset & Nextflow Pipeline

For the purpose of this tutorial, we are using a subset of publicly available RNA-sequencing data generated by the Cancer Cell Line Encyclopedia (CCLE) initiative.

RNA-sequencing data is processed using the nf-core/rna-seq Nextflow pipeline with the nf-quilt plugin to package pipeline outputs into a Quilt package with pipeline parameters as metadata.

Although focussed on bulk RNA-seq data, this tutorial is generalizable - with the core principles applying across data types, and reproducible with your in-house datasets.

Project Outline

We have generated a series of 4 core tutorials (+ 1 optional) demonstrating a framework to go from raw NGS data to annotated Quilt data packages with metadata & Nextflow pipeline outputs, to enable quick data access & queries through AWS Athena.

1. Annotated Quilt Packages for Raw NGS Data & Metadata

00_curate_raw_ccle_rnaseq_data.ipynb (optional)
01_create_metadata_workflow_schema.ipynb
02_generate_raw_data_pkgs_with_metadata.ipynb

Raw data is either generated in house by an instrument, or as in the case of this demo, curated from a public source. Here, we downloaded raw RNA-sequencing data in the form of fastqs from the Sequence Read Archive (SRA). Raw sequencing data was then packaged into Quilt packages, 1 per package per sample.

Sample-level metadata describing both biological (tumor type, patient age, histology ...) and technical (sequencer used, library kit, freezing media used for storage ...) features of the sample were obtained from SRA and attached as metadata to each Quilt package housing raw data.

Quilt workflows & metadata schemas were used to ensure the integrity of the metadata across samples -- a key step to maximize the utility of sample metadata in downstream analysis! No more Tumor vs. tumor vs. tumour...!!

2. Tractable Nextflow Pipeline Processing with nf-quilt

03_run_nfcore_rnaseq_with_nfquilt.ipynb

The Nextflow nf-core/rnaseq pipeline, in conjunction with nf-quilt was used to process raw sequencing data (fastqs) and generate per sample expression values. Samples were processed together in batches (called "runs"), mirroring common practice in NGS centers when multiple samples on a sequencing flow cell are pre-processed at the same time. The nf-quilt plugin automatically packages Nextflow pipeline output into a Quilt package, and appends detailed pipeline run metadata to the package.

3. Metadata & Pipeline Results Data Lake

04_athena_metadata_nfcore_output.ipynb

To enable valuable data searches, we must align the sample-level metadata appended to the raw data packages to the pipeline outputs. In this demo, the primary data generated by the pipeline is expression tables. With Athena, its possible to integrate sample metadata & pipeline output tables together to empower quick queries and slicing and dicing of large datasets.

4. Efficiently Query & Analyze Pipeline Outputs Alongside Sample Metadata

06_query_athena_data_and_perform_analysis

Once Athena is enabled, the world (or data in this case...) is the Computational Biologist's oyster! Computational biologists can now use the Athena to make SQL queries to obtain desired subsets of data to empower their analysis quickly and efficiently. Queries can be performed directly in Jupyter notebooks, enabling seamless data loading upstream of analysis.

In contrast, without Athena capabilities, comp bio folks would have figure out which samples they want by loading a master metadata table somewhere, perform some detective work to track down where the output tables of their desired samples live, and load those files 1-by-1.

Additionally, Athena tables are compatible with interactive dashboards (e.g. Tableau, Spotfire, QuickSight), allowing you to keep track of the number of samples, which samples, or other accounting metrics that may be helpful beyond computational teams (business development, project management) in a "no-code" manner.

Pre-Requisites

The tutorials are in the form of Jupyter Notebooks, and are fully executable. To run the notebooks, the following pre-requisites are required:

  1. Python >=3.7
  2. Required Python packages:pip install -r requirements.txt
  3. AWS credentials
  4. Quilt Open Data Account
  5. NextFlow Tower Account (optional)

Questions?

We love to help! Please reach out to the Quilt Data team with any comments or questions. Let's get your data up to snuff together!