. 2016 Jun 22:6:28517.

doi: 10.1038/srep28517.

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Feng Liu¹, Hao Li¹, Chao Ren¹, Xiaochen Bo¹, Wenjie Shu¹

Affiliations

PMID: 27329130
PMCID: PMC4916453
DOI: 10.1038/srep28517

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Feng Liu et al. Sci Rep. 2016.

. 2016 Jun 22:6:28517.

doi: 10.1038/srep28517.

Authors

Feng Liu¹, Hao Li¹, Chao Ren¹, Xiaochen Bo¹, Wenjie Shu¹

Affiliation

¹ Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.

PMID: 27329130
PMCID: PMC4916453
DOI: 10.1038/srep28517

Abstract

Transcriptional enhancers are non-coding segments of DNA that play a central role in the spatiotemporal regulation of gene expression programs. However, systematically and precisely predicting enhancers remain a major challenge. Although existing methods have achieved some success in enhancer prediction, they still suffer from many issues. We developed a deep learning-based algorithmic framework named PEDLA (https://github.com/wenjiegroup/PEDLA), which can directly learn an enhancer predictor from massively heterogeneous data and generalize in ways that are mostly consistent across various cell types/tissues. We first trained PEDLA with 1,114-dimensional heterogeneous features in H1 cells, and demonstrated that PEDLA framework integrates diverse heterogeneous features and gives state-of-the-art performance relative to five existing methods for enhancer prediction. We further extended PEDLA to iteratively learn from 22 training cell types/tissues. Our results showed that PEDLA manifested superior performance consistency in both training and independent test sets. On average, PEDLA achieved 95.0% accuracy and a 96.8% geometric mean (GM) of sensitivity and specificity across 22 training cell types/tissues, as well as 95.7% accuracy and a 96.8% GM across 20 independent test cell types/tissues. Together, our work illustrates the power of harnessing state-of-the-art deep learning techniques to consistently identify regulatory elements at a genome-wide scale from massively heterogeneous data across diverse cell types/tissues.

PubMed Disclaimer

Figures

**Figure 1. Enhancer predictions using PEDLA with heterogeneous signatures and class-imbalanced data in H1 cells.**
(A) Selecting the optimal structure of PEDLA for the purpose of enhancer prediction using 1,114 heterogeneous signatures. Three performance indicators, accuracy, GM and F1-score, were measured using 5-fold cross-validation in both a training set and test set. The error bar indicates the mean and standard deviation of the performance indicator. (B) Validation of enhancer predictions by distal DHSs, binding sites of p300 and TFs (NANOG, OCT4 and SOX2) using the trained PEDLA with optimal structure. The bar shows the actual validation rate, whereas the error bar shows the mean and standard deviation of validation rates for 10,000 randomly shuffled predictions. (**C,D**) Capability of handling class-imbalanced data unbiasedly. Three performance indicators, sensitivity, specificity and GM, were measured for the training set (C) and test set (D) using 5-fold cross-validation based on the optimal structure of PEDLA with all 1,114-dimensional features. The number of enhancers, promoters and random regions not annotated as promoters or enhancers were maintained at 1:1:x (x = 1, 2, …, 9), such that the ratio between positive and negative samples was 1:(1 + x).

**Figure 2. Training procedure of PEDLA in multiple human cell types.**
(A) Schematic diagram showing the framework of training PEDLA for the purpose of enhancer prediction in multiple human cells/tissues. The whole training procedure comprises initial training and iterative training. (B) The pseudocode shows the detailed steps of training PEDLA in multiple human cells/tissues.

**Figure 3. Evaluation of enhancer predictions using PEDLA in multiple human cells/tissues.**
(A) Evaluation of enhancer prediction using PEDLA in 22 training cell types/tissues. The whole training procedure was repeated with 50 random orders of the 22 training cell types/tissues, and each random order was repeated four times with random permutations of training samples for each cell type/tissue. In total, the training of PEDLA was repeated 200 times on the 22 training cell types/tissues, independently. For each repeat, the trained optimally model of PEDLA was saved for later evaluation for each of the 22 training cell types/tissues. Thus, 22 × 200 = 4,400 optimal models were generated. (1 ≤ j ≤ 22,1 ≤ i ≤ 200) denotes the optimal model that finished training on the j-th training cell type/tissue in the i-th run of the 200 independent runs. (**B,C**) Performance evaluations of PEDLA in the training cell set and the independent test cell set along the training route. Three performance indicators, accuracy, GM, and F1-score, were assessed for the PEDLA with the trained optimal model in the 22 training cell types/tissues and 20 test cell types/tissues, independently. (B) For a fixed j of the X-axis, all 200 optimal models (1 ≤ i ≤ 200) were used to assess the performance indicators on the 22 training cell types/tissues. The red line represents the mean of the total 200 × 22 = 4,400 values of each performance indicator, and the light blue colour band indicates the 10^th and 90^th percentiles. (C) For a fixed j of the X-axis, all 200 optimal models (1 ≤ i ≤ 200) were used to assess the performance indicators on the 20 test cell types/tissues. The red line represents the mean of the total 200 × 20 = 4,000 values of each performance indicator, and the light blue colour band indicates the 10^th and 90^th percentiles.

formula image — **Figure 3. Evaluation of enhancer predictions using PEDLA in multiple human cells/tissues.**
(A) Evaluation of enhancer prediction using PEDLA in 22 training cell types/tissues. The whole training procedure was repeated with 50 random orders of the 22 training cell types/tissues, and each random order was repeated four times with random permutations of training samples for each cell type/tissue. In total, the training of PEDLA was repeated 200 times on the 22 training cell types/tissues, independently. For each repeat, the trained optimally model of PEDLA was saved for later evaluation for each of the 22 training cell types/tissues. Thus, 22 × 200 = 4,400 optimal models were generated. (1 ≤ j ≤ 22,1 ≤ i ≤ 200) denotes the optimal model that finished training on the j-th training cell type/tissue in the i-th run of the 200 independent runs. (**B,C**) Performance evaluations of PEDLA in the training cell set and the independent test cell set along the training route. Three performance indicators, accuracy, GM, and F1-score, were assessed for the PEDLA with the trained optimal model in the 22 training cell types/tissues and 20 test cell types/tissues, independently. (B) For a fixed j of the X-axis, all 200 optimal models (1 ≤ i ≤ 200) were used to assess the performance indicators on the 22 training cell types/tissues. The red line represents the mean of the total 200 × 22 = 4,400 values of each performance indicator, and the light blue colour band indicates the 10^th and 90^th percentiles. (C) For a fixed j of the X-axis, all 200 optimal models (1 ≤ i ≤ 200) were used to assess the performance indicators on the 20 test cell types/tissues. The red line represents the mean of the total 200 × 20 = 4,000 values of each performance indicator, and the light blue colour band indicates the 10^th and 90^th percentiles.

**Figure 4. Performance assessment of PEDLA with the best-trained model in multiple human cells/tissues.**
(**A,B**) Performance assessment of PEDLA with the best-trained model in the training cell set and test cell set. Three performance indicators, accuracy, GM, and F1-score, were assessed for the PEDLA with the best trained model in the 22 training cell types/tissues (A) and 20 test cell types/tissues (B), independently. The best-trained model was the only one selected in terms of performance from the 200 optimal models (1 ≤ i ≤ 200) that finished training on 22 training cell types/tissues. All enhancers were classified as “specific” or “common” based on the number of cell types/tissues in which the enhancers occurred. An enhancer that occurred in not more than 4 cell types/tissues was termed specific; otherwise, it was considered common.

See this image and copyright information in PMC

Cited by

Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs.
Wang Y, Jin W, Pan X, Liao W, Shen Q, Cai J, Gong W, Tian Y, Xu D, Li Y, Li J, Gong J, Zhang Z, Yuan X. Wang Y, et al. Sci Data. 2024 Feb 1;11(1):157. doi: 10.1038/s41597-024-02960-7. Sci Data. 2024. PMID: 38302497 Free PMC article.
Deep learning approaches for noncoding variant prioritization in neurodegenerative diseases.
Lan AY, Corces MR. Lan AY, et al. Front Aging Neurosci. 2022 Nov 18;14:1027224. doi: 10.3389/fnagi.2022.1027224. eCollection 2022. Front Aging Neurosci. 2022. PMID: 36466610 Free PMC article. Review.
Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy.
Arbel H, Basu S, Fisher WW, Hammonds AS, Wan KH, Park S, Weiszmann R, Booth BW, Keranen SV, Henriquez C, Shams Solari O, Bickel PJ, Biggin MD, Celniker SE, Brown JB. Arbel H, et al. Proc Natl Acad Sci U S A. 2019 Jan 15;116(3):900-908. doi: 10.1073/pnas.1808833115. Epub 2018 Dec 31. Proc Natl Acad Sci U S A. 2019. PMID: 30598455 Free PMC article.
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.
Li Y, Shi W, Wasserman WW. Li Y, et al. BMC Bioinformatics. 2018 May 31;19(1):202. doi: 10.1186/s12859-018-2187-1. BMC Bioinformatics. 2018. PMID: 29855387 Free PMC article.
Regulatory elements in molecular networks.
Doane AS, Elemento O. Doane AS, et al. Wiley Interdiscip Rev Syst Biol Med. 2017 May;9(3):10.1002/wsbm.1374. doi: 10.1002/wsbm.1374. Epub 2017 Jan 17. Wiley Interdiscip Rev Syst Biol Med. 2017. PMID: 28093886 Free PMC article. Review.

See all "Cited by" articles

References

1. Bulger M. & Groudine M. Enhancers: the abundance and function of regulatory sequences beyond promoters. Dev Biol 339, 250–257 (2010). - PMC - PubMed
1. Ong C. T. & Corces V. G. Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet 12, 283–293 (2011). - PMC - PubMed
1. Calo E. & Wysocka J. Modification of enhancer chromatin: what, how, and why? Mol Cell 49, 825–837 (2013). - PMC - PubMed
1. Bulger M. & Groudine M. Functional and mechanistic diversity of distal transcription enhancers. Cell 144, 327–339 (2011). - PMC - PubMed
1. Bonn S. et al.. Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development. Nat Genet 44, 148–156 (2012). - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Affiliation

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources