Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 22:6:28517.
doi: 10.1038/srep28517.

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Affiliations

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Feng Liu et al. Sci Rep. .

Abstract

Transcriptional enhancers are non-coding segments of DNA that play a central role in the spatiotemporal regulation of gene expression programs. However, systematically and precisely predicting enhancers remain a major challenge. Although existing methods have achieved some success in enhancer prediction, they still suffer from many issues. We developed a deep learning-based algorithmic framework named PEDLA (https://github.com/wenjiegroup/PEDLA), which can directly learn an enhancer predictor from massively heterogeneous data and generalize in ways that are mostly consistent across various cell types/tissues. We first trained PEDLA with 1,114-dimensional heterogeneous features in H1 cells, and demonstrated that PEDLA framework integrates diverse heterogeneous features and gives state-of-the-art performance relative to five existing methods for enhancer prediction. We further extended PEDLA to iteratively learn from 22 training cell types/tissues. Our results showed that PEDLA manifested superior performance consistency in both training and independent test sets. On average, PEDLA achieved 95.0% accuracy and a 96.8% geometric mean (GM) of sensitivity and specificity across 22 training cell types/tissues, as well as 95.7% accuracy and a 96.8% GM across 20 independent test cell types/tissues. Together, our work illustrates the power of harnessing state-of-the-art deep learning techniques to consistently identify regulatory elements at a genome-wide scale from massively heterogeneous data across diverse cell types/tissues.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Enhancer predictions using PEDLA with heterogeneous signatures and class-imbalanced data in H1 cells.
(A) Selecting the optimal structure of PEDLA for the purpose of enhancer prediction using 1,114 heterogeneous signatures. Three performance indicators, accuracy, GM and F1-score, were measured using 5-fold cross-validation in both a training set and test set. The error bar indicates the mean and standard deviation of the performance indicator. (B) Validation of enhancer predictions by distal DHSs, binding sites of p300 and TFs (NANOG, OCT4 and SOX2) using the trained PEDLA with optimal structure. The bar shows the actual validation rate, whereas the error bar shows the mean and standard deviation of validation rates for 10,000 randomly shuffled predictions. (C,D) Capability of handling class-imbalanced data unbiasedly. Three performance indicators, sensitivity, specificity and GM, were measured for the training set (C) and test set (D) using 5-fold cross-validation based on the optimal structure of PEDLA with all 1,114-dimensional features. The number of enhancers, promoters and random regions not annotated as promoters or enhancers were maintained at 1:1:x (x = 1, 2, …, 9), such that the ratio between positive and negative samples was 1:(1 + x).
Figure 2
Figure 2. Training procedure of PEDLA in multiple human cell types.
(A) Schematic diagram showing the framework of training PEDLA for the purpose of enhancer prediction in multiple human cells/tissues. The whole training procedure comprises initial training and iterative training. (B) The pseudocode shows the detailed steps of training PEDLA in multiple human cells/tissues.
Figure 3
Figure 3. Evaluation of enhancer predictions using PEDLA in multiple human cells/tissues.
(A) Evaluation of enhancer prediction using PEDLA in 22 training cell types/tissues. The whole training procedure was repeated with 50 random orders of the 22 training cell types/tissues, and each random order was repeated four times with random permutations of training samples for each cell type/tissue. In total, the training of PEDLA was repeated 200 times on the 22 training cell types/tissues, independently. For each repeat, the trained optimally model of PEDLA was saved for later evaluation for each of the 22 training cell types/tissues. Thus, 22 × 200 = 4,400 optimal models were generated. formula image (1 ≤ j ≤ 22,1 ≤ i ≤ 200) denotes the optimal model that finished training on the j-th training cell type/tissue in the i-th run of the 200 independent runs. (B,C) Performance evaluations of PEDLA in the training cell set and the independent test cell set along the training route. Three performance indicators, accuracy, GM, and F1-score, were assessed for the PEDLA with the trained optimal model in the 22 training cell types/tissues and 20 test cell types/tissues, independently. (B) For a fixed j of the X-axis, all 200 optimal models formula image(1 ≤ i ≤ 200) were used to assess the performance indicators on the 22 training cell types/tissues. The red line represents the mean of the total 200 × 22 = 4,400 values of each performance indicator, and the light blue colour band indicates the 10th and 90th percentiles. (C) For a fixed j of the X-axis, all 200 optimal models formula image(1 ≤ i ≤ 200) were used to assess the performance indicators on the 20 test cell types/tissues. The red line represents the mean of the total 200 × 20 = 4,000 values of each performance indicator, and the light blue colour band indicates the 10th and 90th percentiles.
Figure 4
Figure 4. Performance assessment of PEDLA with the best-trained model in multiple human cells/tissues.
(A,B) Performance assessment of PEDLA with the best-trained model in the training cell set and test cell set. Three performance indicators, accuracy, GM, and F1-score, were assessed for the PEDLA with the best trained model in the 22 training cell types/tissues (A) and 20 test cell types/tissues (B), independently. The best-trained model was the only one selected in terms of performance from the 200 optimal models formula image(1 ≤ i ≤ 200) that finished training on 22 training cell types/tissues. All enhancers were classified as “specific” or “common” based on the number of cell types/tissues in which the enhancers occurred. An enhancer that occurred in not more than 4 cell types/tissues was termed specific; otherwise, it was considered common.

Similar articles

Cited by

References

    1. Bulger M. & Groudine M. Enhancers: the abundance and function of regulatory sequences beyond promoters. Dev Biol 339, 250–257 (2010). - PMC - PubMed
    1. Ong C. T. & Corces V. G. Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet 12, 283–293 (2011). - PMC - PubMed
    1. Calo E. & Wysocka J. Modification of enhancer chromatin: what, how, and why? Mol Cell 49, 825–837 (2013). - PMC - PubMed
    1. Bulger M. & Groudine M. Functional and mechanistic diversity of distal transcription enhancers. Cell 144, 327–339 (2011). - PMC - PubMed
    1. Bonn S. et al.. Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development. Nat Genet 44, 148–156 (2012). - PubMed

Publication types

LinkOut - more resources