maxATAC: a suite of user-friendly, deep neural network models for transcription factor binding prediction from ATAC-seq

Introduction

Cellular behavior is the result of complex genomic regulation partially controlled by the activity of DNA binding proteins called transcription factors (TFs). TFs bind DNA in a sequence specific manner to regulate gene transcription. TFs, other DNA binding proteins, nucleosomes, and structural proteins are all involved in the regulation of gene expression and their physical interaction with DNA molecules results in changes in the accessibility of the underlying DNA sequence. The assay for transposase accessible chromatin uses a hyperactive Tn5 transposase to probe for genomic regions that are accessible to cleavage, and in turn, accessible to TF binding. It has been shown that distinct patterns in genomic Tn5 cleavage signal can be used to identify TF binding positions that are partially protected from Tn5 cleavage, known as TF footprints. ATAC-seq can also be used to identify regions of the genome that are generally accessible. Here we present a method to predict TF binding by learning from ATAC-seq accessibility signal and the underlying DNA sequence of TF binding locations identified by ChIP-seq.

The maxATAC package is a collection of tools used for learning to predict TF binding from ATAC-seq data. MaxATAC also provides functions for interpreting trained models and preparing the input data.

Requirements

This version requires python 3.6 and BEDTools.

Predicting TF Binding Workflow Overview

Steps in training and assessing a maxATAC model. Relevant functions are listed below each step.

Prepare Input Data
- average
- normalize
Train a model
- roi
- train
Predict in new cell type
- predict
Benchmark models against experimental data
- benchmark
Learn features import to predict TF binding with a neural network
- interpret

Functions

The maxATAC bundle has several useful functions needed for building a deep learning model of TF binding.

Data Pre-processing:

average
normalize

TF Binding Prediction Functions:

train
predict
benchmark
roi
interpret

Average

The average function will average multiple bigwig files into a single output bigwig file.

This function can take a list of input bigwig files and average their scores using pyBigWig. The only requirement for the bigwig files is that they contain the same chromosomes or there might be an error about retrieving scores.

Workflow Overview

Create directories and set up filenames
Build a dictionary of chromosome sizes and filter it based on desired chromosomes to average
Open the bigwig file for writing
Loop through each entry in the chromosome sizes dictionary and calculate the average across all inputs
Write the bigwig file

Normalize

The normalize function will take an input bigwig file and minmax normalize the values genome wide.

This function will min-max a bigwig file based on the minimum and maximum values in the chromosomes of interest. The code will loop through each chromosome and find the min and max values. It will then create a dataframe of the values per chromosome. It will then scale all other values between [0,1].

Workflow Overview

Create directories and set up filenames
Build a dictionary of the chromosomes sizes.
Find the genomic min and max values by looping through each chromosome
Loop through each chromosome and minmax normalize the values based on the genomic values.

Train

The train function takes as input ATAC-seq signal, DNA sequence, Delta ATAC-seq signal, and ChIP-seq signal to train a neural network with the architecture of choice.

The primary input to the training function is a meta file that contains all of the information for the locations of ATAC-seq signal, ChIP-seq signal, TF, and Cell type.

Example header for meta file. The meta file must be a tsv file, but the order of the columns does not matter. As long as the column names are the same:

Workflow Overview

Create directories and set up filenames
Initialize the model based on the desired architectures
Read in the meta table
Read in training and validation pool
Initialize the training generator
Initialize the validation generator
Fit the models with the specific parameters

Predict

The predict function takes as input BED formatted genomic regions to predict TF binding using a trained maxATAC model.

BED file requirements for prediction. You must have at least a 3 column file with chromosome, start, and stop coordinates. The interval distance has to be the same as the distance used to train the model. If you trained a model with a resolution 1024, you need to make sure your intervals are spaced 1024 bp apart for prediction with your model.

Example input BED file for prediction:

chr1 1000 2024

Workflow Overview

Create directories and set up filenames
Make predictions
Convert predictions to bigwig format and write results

Benchmark

The benchmark function takes as input a prediction bigwig signal track and a ChIP-seq gold standard bigwig track to calculate precision and recall.

The inputs need to be in bigwig format to use this function. You can also provide a custom blacklist to filter out regions that you do not want to include in your comparison. We use a np.mask to exclude these regions.

Currently, benchmarking is set up for one chromosome at a time. The most time-consuming step is importing and binning the input bigwig files to resolutions smaller than 100bp. We are also only benchmarking on whole chromosomes at the moment so everything not in the blacklist will be considered a potential region.

Workflow Overview

Create directories and set up filenames
Get the blacklist mask using the input blacklist and bin it at the same resolution as the predictions and GS
Calculate the AUPR

ROI

The roi function will generate regions of interest based on input ChIP-seq, ATAC-seq, or randomly generated regions.

This method will use the run meta file to merge the ATAC-seq peaks and ChIP-seq peaks into BED files where each entry is an example region. ONE MAJOR ASSUMPTION IS THAT THE TEST CELL LINE IS NOT INCLUDED IN THE META FILE!!!!!

The input meta file must have the columns in any order:

Workflow Overview

Import the ATAC-seq and ChIP-seq and filter for training chromosomes
Write the ChIP, ATAC, and combined ROI pools with stats
Import the ATAC-seq and ChIP-seq and filter for validation chromosomes
Write the ChIP, ATAC, and combined ROI pools with stats

Interpret

The interpret function will interpret a trained maxATAC model using TFmodisco, DeepLift, and DeepShap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

maxATAC: a suite of user-friendly, deep neural network models for transcription factor binding prediction from ATAC-seq

Introduction

Requirements

Predicting TF Binding Workflow Overview

Functions

Average

Normalize

Train

Predict

Benchmark

ROI

Interpret

Files

README.md

Latest commit

History

README.md

File metadata and controls

maxATAC: a suite of user-friendly, deep neural network models for transcription factor binding prediction from ATAC-seq

Introduction

Requirements

Predicting TF Binding Workflow Overview

Functions

Average

Normalize

Train

Predict

Benchmark

ROI

Interpret