Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 17;14(6):e0218073.
doi: 10.1371/journal.pone.0218073. eCollection 2019.

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Affiliations

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Rajiv Movva et al. PLoS One. .

Abstract

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Predicting regulatory activity in MPRAs using convolutional neural networks.
(A) Outline of the design of Sharpr-MPRA experiments used in this study. A collection of DNA constructs is cloned into a plasmid library upstream of a promoter (magenta) and transfected into a population of cells. Each construct is linked to a unique barcode (BC) located in the transcribed region; measuring the abundance of these barcodes using high-throughput sequencing allows for evaluation of the regulatory activity of each construct. (B) In the Sharpr-MPRA design, 145 bp-long 5-bp tilings of each of ∼15,000 candidate 295 bp cis-regulatory elements are cloned upstream of either a minimal promoter (minP) or a strong promoter (SV40P). (C) Reproducibility between individual replicate Sharpr-MPRA measurements of regulatory activity (shown is data for K562 cells using the minP promoter). (D) Overview of the MPRA-DragoNN convolutional multi-task neural network architecture. The genomic DNA sequence for each tested MPRA construct is transformed from nucleotides (in ACGT alphabet) to a 145 × 4 one-hot encoded array. Three convolution layers and a fully-connected (FC) layer are then applied to predict four tasks (regulatory activity for the two cell lines with each of the two promoters). Each convolutional layer consists of 120 filters of length 5 (rectangles) that move along the sequence, searching for specific patterns of length 5 at every possible position. The first convolutional layer can be interpreted as identifying individual DNA sequence recognition motifs, such as those recognized by transcription factors. The second convolutional layer combines nearby potentially interacting motifs, while the third layer abstracts higher-order grammars (positioning, spacing, and other meta-features). Finally, the FC layer synthesizes these patterns with cell type– and promoter–specific information to make activity predictions.
Fig 2
Fig 2. MPRA-DragoNN distinguishes active regulatory sequences at high resolution.
(A) Predicted regulatory activity z-scores vs. experimental activity z-scores for the K562 minP task. (B) Distributions of experimental and predicted regulatory activities for different ChromHMM-inferred chromatin states. (C) K562 DeepLIFT nucleotide score track for a strongly activating regulatory sequence (top 0.1%) containing three TF binding sites (red) as identified by the CENTIPEDE algorithm. All three TFBSs are detected with statistical significance (Mann-Whitney U test). (D) Nucleotides with strong (in absolute value) DeepLIFT scores are more likely to overlap with TF binding sites than control sequences (blue: all nucleotides, green: DNase peak centers). This trend holds for both positive (R = 0.99) and negative scores (R = −0.94).
Fig 3
Fig 3. MPRA-DragoNN DeepLIFT feature importance scores robustly predict functional nucleotides.
(A) Overlap between significant motif instances (Benjamini-Hochberg FDR < 0.1) identified by DeepLIFT and Sharpr. (B) Scatter plot of average DeepLIFT scores for 1934 motifs in HepG2 (x-axis) and K562 (y-axis) [39]. Orange points are discussed in the text. (D) Sharpr-MPRA nucleotide score distributions for (i) motifs that are also DeepLIFT hits, (ii) all motifs, and (iii) negative control shuffled motifs. (C) Distributions of average DeepLIFT motif scores for ETS, HNF4, REST, and their respective control motifs (shuffled versions) in both K562 and HepG2. ***p < 10−200; n.s., not significant. (E) Positional distribution of DeepLIFT scores (left) and Sharpr scores (right) with respect to the center of ETS motif occurrences. Note that the DeepLIFT plot x-axis ranges from -50 bp to 50 bp while the Sharpr plot ranges from -100 bp to 100 bp. All p-values are computed with the Mann-Whitney U test.
Fig 4
Fig 4. MPRA-DragoNN reveals patterns of transcription factor activity.
(A) For each TF, we computed the ratio of average “usage” in promoter sequences relative to enhancer sequences. The plot contains z-scores of this ratio for 27 selected transcription factors, colored by their motif family (left). (B) Clustered correlation matrix of TF usage for the 27 factors from (A). Each cell is colored according to the motif usage Spearman correlation for a given pair of TFs across all ∼974,000 sequences. Rows are colored by their motif family.
Fig 5
Fig 5. Variant in-silico mutagenesis scores agree with experimental data.
(A) Regulatory activity changes between reference and mutated sequences predicted by MPRA-DragoNN agree with experimentally measured changes [36]. Red points indicate variants that were significant in the wild-type K562 condition (see description in Methods) (B) Detailed examination of a particular variant, rs2269907 at chromosome 17 position 44,294,214. The distribution of epigenetic marks and JunD ChIP-seq signal [2] around the variant reveals that it lies in an active region [51]; the variant appears to correct a mismatch in one of the base pairs of a JunD motif, allowing JunD to bind and regulate expression.
Fig 6
Fig 6. Dissecting rs174593, a putative causal variant for reduced LDL cholesterol levels.
(A) Volcano plot with in-silico mutagenesis scores on the x-axis and negative log GWAS p-values on the y-axis. Putatively causal variants in the upper right and upper left regions localize to cardiovascular disease-related genes. (B) DeepLIFT track and saturation mutagenesis scores of the locus surrounding rs174593, a potential FADS2 cis-regulatory element. As highlighted by DeepLIFT, the C allele at that position creates an ELK1 motif match, increasing predicted FADS2 expression compared to the T allele.

Similar articles

Cited by

References

    1. Lee TI, Young RA. Transcriptional Regulation and Its Misregulation in Disease. Cell. 2013;152(6):1237–1251. 10.1016/j.cell.2013.02.014 - DOI - PMC - PubMed
    1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. 10.1038/nature11247 - DOI - PMC - PubMed
    1. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–330. 10.1038/nature14248 - DOI - PMC - PubMed
    1. Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotech. 2012;30(3):271–277. 10.1038/nbt.2137 - DOI - PMC - PubMed
    1. Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotech. 2012;30(3):265–270. 10.1038/nbt.2136 - DOI - PMC - PubMed

Publication types