Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 30;15(4):e0232332.
doi: 10.1371/journal.pone.0232332. eCollection 2020.

Combining signal and sequence to detect RNA polymerase initiation in ATAC-seq data

Affiliations

Combining signal and sequence to detect RNA polymerase initiation in ATAC-seq data

Ignacio J Tripodi et al. PLoS One. .

Abstract

The assay for transposase-accessible chromatin followed by sequencing (ATAC-seq) is an inexpensive protocol for measuring open chromatin regions. ATAC-seq is also relatively simple and requires fewer cells than many other high-throughput sequencing protocols. Therefore, it is tractable in numerous settings where other high throughput assays are challenging to impossible. Hence it is important to understand the limits of what can be inferred from ATAC-seq data. In this work, we leverage ATAC-seq to predict the presence of nascent transcription. Nascent transcription assays are the current gold standard for identifying regions of active transcription, including markers for functional transcription factor (TF) binding. We combine mapped short reads from ATAC-seq with the underlying peak sequence, to determine regions of active transcription genome-wide. We show that a hybrid signal/sequence representation classified using recurrent neural networks (RNNs) can identify these regions across different cell types.

PubMed Disclaimer

Conflict of interest statement

One author (RDD) of this publication is a founder and scientific advisor for Arpeggio Biosciences. Dr. Dowell is not employed by Arpeggio but rather consults occasionally with the company. We also note that no aspect of this work was funded by or influenced in any way by the company. This work is funded entirely by NIH R01 GM125871. No aspect of our funding alters our adherence to PLOS ONE policies on sharing data and materials.

Figures

Fig 1
Fig 1. Hybrid encoding and RNN model architecture.
(a) A vector embedding was trained for each nucleotide (top left, also including other base symbols following the IUPAC convention). For our signal/sequence hybrid model, we generated a 50-dimension training vector for each peak by combining nucleotide information (a vector embedding based on neighboring nucleotides) and the normalized number of ATAC-seq reads mapped for that nucleotide (by millions mapped). In this example, we show how a small portion of an OCR detected with ATAC-seq (top right, green) with the sequence ACTTCCT would be represented in two dimensions (bottom, one nucleotide per column), with the first row reflecting the normalized read coverage for each of those nucleotides and the rest of each column consisting of the nucleotide’s dense vector representation. (b) Nucleotides in the 1kbp evaluation window are extracted from the reference genome (bottom blue layer) were passed to an embedding layer (orange) to generate a dense vector representation from each. The peak signal level associated to each nucleotide (middle blue layer; i.e., the number of mapped ATAC-seq reads normalized by millions mapped) is then combined with the nucleotide embedding vector (purple layer, vector representation shown in panel a). Each vector is passed to a gated recurrent unit in each direction (green layer) to capture the long- and short-term relations between nucleotides, and the outputs from the last forward and reverse gates are concatenated to be used or the final prediction.
Fig 2
Fig 2. Accessibility vs. transcription.
Each point in this scatter plot is an ATAC-seq peak, where we compare the mean number of mapped ATAC-seq reads in its 1kbp evaluation window (y-axis) to the mean number of mapped nascent transcription reads on that same window (x-axis). There is essentially no correlation (r2 = 0.084) between the two, making this average peak metric not sufficient to predict active transcription.
Fig 3
Fig 3. Classifier performance across cell types.
(a) Receiver operating characteristic (ROC) area under the curve (AUC, light blue), (b) F1-score (tan), and (a) RNN training time (green) for LOOT-based performance evaluation. OCRs from each cell type tested are displayed using the same marker (see key).
Fig 4
Fig 4. Cell type focused strategy results.
ROC curves resulting from testing on the different OCRs corresponding to each cell type, in a leave-one-out fashion.
Fig 5
Fig 5. Distribution of ATAC-seq reads for classification results.
Distribution of mapped reads from ATAC-seq SRRs, for OCRs corresponding to the training set (green histograms, top) and each classification metric (blue histograms, metric noted in upper right corner of each panel). Note the difference in y-axis scales among plots, as the size of each set differs.
Fig 6
Fig 6. Meta-peaks from ATAC-seq signal at OCRs.
Meta-peak plot generated by combining the ATAC-seq signal at each 1kbp evaluation window centered at OCRs for the entire training set (top row, green axis) and each classification metric: true positives (mid left), true negatives (mid right), false positives bottom left) and false negatives (bottom right). Note the difference in scales among plots, to emphasize the characteristic shape in each scenario.
Fig 7
Fig 7. Distribution of nascent transcription reads for classification results.
Distribution of mapped reads from nascent transcription SRRs, for OCRs corresponding to the training set (green histograms, top) and each classification metric (blue histograms, metric noted in upper right corner of each panel). Note the difference in scales among plots, to better appreciate the distribution of coverage in each scenario. The leftmost bin in the “positive”, “TP”, and “NF” panels correspond to very low levels of nascent transcription rather than no transcription, which are generally associated to regulatory regions.
Fig 8
Fig 8. Meta-peaks from nascent transcription signal at OCRs.
Meta-peak plot generated by combining the nascent transcription signal at each 1kbp evaluation window centered at OCRs for training data (top row, green axis) and each classification metric (middle and bottom). Signal is color coded by strand (blue is positive strand; red negative strand). Notice the differences in scale among plots, with TPs and FNs sharing the same scale, but distinct from TN and FP.
Fig 9
Fig 9. Commonly observed OCRs dominate performance.
Proportion of OCRs common to every cell type (overlapping in genomic coordinates) categorized in the different performance metrics.
Fig 10
Fig 10. TSS regions are generally harder to classify than non-TSS ones (regulatory sites).
ROC curves for OCRs overlapping TSSs (green) and non-TSS OCRs (red), for each test set. The orange curves correspond to all OCRs for that test set.

References

    1. The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012;489(7414):57–74. 10.1038/nature11247 - DOI - PMC - PubMed
    1. Lam MTY, Li W, Rosenfeld MG, Glass CK. Enhancer RNAs and regulated transcriptional programs. Trends in Biochemical Sciences. 2014;39(4):170–182. 10.1016/j.tibs.2014.02.007 - DOI - PMC - PubMed
    1. Heinz S, Romanoski CE, Benner C, Glass CK. The selection and function of cell type-specific enhancers. Nature Reviews Molecular Cell Biology. 2015;16:144–154. 10.1038/nrm3949 - DOI - PMC - PubMed
    1. Core LJ, Waterfall JJ, Lis JT. Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters. Science. 2008;322(5909):1845–1848. 10.1126/science.1162228 - DOI - PMC - PubMed
    1. Kwak H, Fuda NJ, Core LJ, Lis JT. Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing. Science (New York, NY). 2013;339(6122):950–953. - PMC - PubMed

Publication types

Substances