. 2019 Jun 17;14(6):e0218073.

doi: 10.1371/journal.pone.0218073. eCollection 2019.

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Rajiv Movva^{1

2}, Peyton Greenside³, Georgi K Marinov², Surag Nair⁴, Avanti Shrikumar⁴, Anshul Kundaje^{2

4}

Affiliations

¹ The Harker School, San Jose, CA, United States of America.
² Department of Genetics, Stanford University, Stanford, CA, United States of America.
³ Biomedical Informatics Training Program, Stanford University, Stanford, CA, United States of America.
⁴ Department of Computer Science, Stanford University, Stanford, CA, United States of America.

PMID: 31206543
PMCID: PMC6576758
DOI: 10.1371/journal.pone.0218073

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Rajiv Movva et al. PLoS One. 2019.

. 2019 Jun 17;14(6):e0218073.

doi: 10.1371/journal.pone.0218073. eCollection 2019.

Authors

Rajiv Movva^{1

2}, Peyton Greenside³, Georgi K Marinov², Surag Nair⁴, Avanti Shrikumar⁴, Anshul Kundaje^{2

4}

Affiliations

¹ The Harker School, San Jose, CA, United States of America.
² Department of Genetics, Stanford University, Stanford, CA, United States of America.
³ Biomedical Informatics Training Program, Stanford University, Stanford, CA, United States of America.
⁴ Department of Computer Science, Stanford University, Stanford, CA, United States of America.

PMID: 31206543
PMCID: PMC6576758
DOI: 10.1371/journal.pone.0218073

Abstract

The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Predicting regulatory activity in MPRAs using convolutional neural networks.**
**(A)** Outline of the design of Sharpr-MPRA experiments used in this study. A collection of DNA constructs is cloned into a plasmid library upstream of a promoter (magenta) and transfected into a population of cells. Each construct is linked to a unique barcode (BC) located in the transcribed region; measuring the abundance of these barcodes using high-throughput sequencing allows for evaluation of the regulatory activity of each construct. **(B)** In the Sharpr-MPRA design, 145 bp-long 5-bp tilings of each of ∼15,000 candidate 295 bp *cis*-regulatory elements are cloned upstream of either a minimal promoter (minP) or a strong promoter (SV40P). **(C)** Reproducibility between individual replicate Sharpr-MPRA measurements of regulatory activity (shown is data for K562 cells using the minP promoter). **(D)** Overview of the MPRA-DragoNN convolutional multi-task neural network architecture. The genomic DNA sequence for each tested MPRA construct is transformed from nucleotides (in ACGT alphabet) to a 145 × 4 one-hot encoded array. Three convolution layers and a fully-connected (FC) layer are then applied to predict four tasks (regulatory activity for the two cell lines with each of the two promoters). Each convolutional layer consists of 120 filters of length 5 (rectangles) that move along the sequence, searching for specific patterns of length 5 at every possible position. The first convolutional layer can be interpreted as identifying individual DNA sequence recognition motifs, such as those recognized by transcription factors. The second convolutional layer combines nearby potentially interacting motifs, while the third layer abstracts higher-order grammars (positioning, spacing, and other meta-features). Finally, the FC layer synthesizes these patterns with cell type– and promoter–specific information to make activity predictions.

**Fig 2. MPRA-DragoNN distinguishes active regulatory sequences at high resolution.**
**(A)** Predicted regulatory activity z-scores vs. experimental activity z-scores for the K562 minP task. **(B)** Distributions of experimental and predicted regulatory activities for different ChromHMM-inferred chromatin states. **(C)** K562 DeepLIFT nucleotide score track for a strongly activating regulatory sequence (top 0.1%) containing three TF binding sites (red) as identified by the CENTIPEDE algorithm. All three TFBSs are detected with statistical significance (Mann-Whitney U test). **(D)** Nucleotides with strong (in absolute value) DeepLIFT scores are more likely to overlap with TF binding sites than control sequences (blue: all nucleotides, green: DNase peak centers). This trend holds for both positive (R = 0.99) and negative scores (R = −0.94).

**Fig 3. MPRA-DragoNN DeepLIFT feature importance scores robustly predict functional nucleotides.**
**(A)** Overlap between significant motif instances (Benjamini-Hochberg FDR < 0.1) identified by DeepLIFT and Sharpr. **(B)** Scatter plot of average DeepLIFT scores for 1934 motifs in HepG2 (x-axis) and K562 (y-axis) [39]. Orange points are discussed in the text. **(D)** Sharpr-MPRA nucleotide score distributions for (i) motifs that are also DeepLIFT hits, (ii) all motifs, and (iii) negative control shuffled motifs. **(C)** Distributions of average DeepLIFT motif scores for ETS, HNF4, REST, and their respective control motifs (shuffled versions) in both K562 and HepG2. ***p < 10⁻²⁰⁰; n.s., not significant. **(E)** Positional distribution of DeepLIFT scores (left) and Sharpr scores (right) with respect to the center of ETS motif occurrences. Note that the DeepLIFT plot x-axis ranges from -50 bp to 50 bp while the Sharpr plot ranges from -100 bp to 100 bp. All p-values are computed with the Mann-Whitney U test.

**Fig 4. MPRA-DragoNN reveals patterns of transcription factor activity.**
**(A)** For each TF, we computed the ratio of average “usage” in promoter sequences relative to enhancer sequences. The plot contains z-scores of this ratio for 27 selected transcription factors, colored by their motif family (left). **(B)** Clustered correlation matrix of TF usage for the 27 factors from (A). Each cell is colored according to the motif usage Spearman correlation for a given pair of TFs across all ∼974,000 sequences. Rows are colored by their motif family.

**Fig 5. Variant in-silico mutagenesis scores agree with experimental data.**
**(A)** Regulatory activity changes between reference and mutated sequences predicted by MPRA-DragoNN agree with experimentally measured changes [36]. Red points indicate variants that were significant in the wild-type K562 condition (see description in Methods) **(B)** Detailed examination of a particular variant, rs2269907 at chromosome 17 position 44,294,214. The distribution of epigenetic marks and JunD ChIP-seq signal [2] around the variant reveals that it lies in an active region [51]; the variant appears to correct a mismatch in one of the base pairs of a JunD motif, allowing JunD to bind and regulate expression.

**Fig 6. Dissecting rs174593, a putative causal variant for reduced LDL cholesterol levels.**
**(A)** Volcano plot with in-silico mutagenesis scores on the x-axis and negative log GWAS p-values on the y-axis. Putatively causal variants in the upper right and upper left regions localize to cardiovascular disease-related genes. **(B)** DeepLIFT track and saturation mutagenesis scores of the locus surrounding rs174593, a potential *FADS2* *cis*-regulatory element. As highlighted by DeepLIFT, the C allele at that position creates an ELK1 motif match, increasing predicted *FADS2* expression compared to the T allele.

See this image and copyright information in PMC

Cited by

layerUMAP: A tool for visualizing and understanding deep learning models in biological sequence classification using UMAP.
Jing R, Xue L, Li M, Yu L, Luo J. Jing R, et al. iScience. 2022 Nov 7;25(12):105530. doi: 10.1016/j.isci.2022.105530. eCollection 2022 Dec 22. iScience. 2022. PMID: 36425757 Free PMC article.
A review of deep learning applications in human genomics using next-generation sequencing data.
Alharbi WS, Rashid M. Alharbi WS, et al. Hum Genomics. 2022 Jul 25;16(1):26. doi: 10.1186/s40246-022-00396-x. Hum Genomics. 2022. PMID: 35879805 Free PMC article. Review.
Massively parallel characterization of psychiatric disorder-associated and cell-type-specific regulatory elements in the developing human cortex.
Deng C, Whalen S, Steyert M, Ziffra R, Przytycki PF, Inoue F, Pereira DA, Capauto D, Norton S, Vaccarino FM, Pollen A, Nowakowski TJ, Ahituv N, Pollard KS. Deng C, et al. bioRxiv [Preprint]. 2023 Feb 16:2023.02.15.528663. doi: 10.1101/2023.02.15.528663. bioRxiv. 2023. Update in: Science. 2024 May 24;384(6698):eadh0559. doi: 10.1126/science.adh0559. PMID: 36824845 Free PMC article. Updated. Preprint.
Machine-guided design of cell-type-targeting cis-regulatory elements.
Gosai SJ, Castro RI, Fuentes N, Butts JC, Mouri K, Alasoadura M, Kales S, Nguyen TTL, Noche RR, Rao AS, Joy MT, Sabeti PC, Reilly SK, Tewhey R. Gosai SJ, et al. Nature. 2024 Oct;634(8036):1211-1220. doi: 10.1038/s41586-024-08070-z. Epub 2024 Oct 23. Nature. 2024. PMID: 39443793 Free PMC article.
Defining the fine structure of promoter activity on a genome-wide scale with CISSECTOR.
FitzPatrick VD, Leemans C, van Arensbergen J, van Steensel B, Bussemaker HJ. FitzPatrick VD, et al. Nucleic Acids Res. 2023 Jun 23;51(11):5499-5511. doi: 10.1093/nar/gkad232. Nucleic Acids Res. 2023. PMID: 37013986 Free PMC article.

See all "Cited by" articles

References

1. Lee TI, Young RA. Transcriptional Regulation and Its Misregulation in Disease. Cell. 2013;152(6):1237–1251. 10.1016/j.cell.2013.02.014 - DOI - PMC - PubMed
1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. 10.1038/nature11247 - DOI - PMC - PubMed
1. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–330. 10.1038/nature14248 - DOI - PMC - PubMed
1. Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, et al. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotech. 2012;30(3):271–277. 10.1038/nbt.2137 - DOI - PMC - PubMed
1. Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotech. 2012;30(3):265–270. 10.1038/nbt.2136 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Affiliations

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous