Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 15;116(3):900-908.
doi: 10.1073/pnas.1808833115. Epub 2018 Dec 31.

Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy

Affiliations

Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy

Hamutal Arbel et al. Proc Natl Acad Sci U S A. .

Abstract

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.

Keywords: Drosophila; embryo development; enhancers; machine learning; random forests.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
(A) RF ROC curves for the complete dataset of 7,987 previously validated genomic regions (blue) shows mediocre performance, with an AUC of 0.83. When only class I enhancers and nonenhancers are used for training, the predictive power rises sharply, AUC of 0.99 (yellow). When only class II enhancers and nonenhancers are used, the result is close to a random guess (gray). When predicting the class I enhancer set the ROC curves for RFs, logistic regression, and a naïve Bayes classifier are nearly overlapping. (B) This can be explained by the colocalization of class II enhancers and nonenhancers in a PCA projection. (C) The separation is mainly driven by TFs as exemplified by the normalized ChiP strength across features of 200 randomly selected class I and class II enhancers.
Fig. 2.
Fig. 2.
False-positive rate is a function of method accuracy and imbalance in the test data. (A) A 3D surface plot shows a sharp increase in the test-set false-positive rate as either the training set false-positive rate or the fraction of nonenhancer regions in the test-set increase. This shows that in genomic settings, where the imbalance cannot be controlled, a very high degree of accuracy is required. (B and C) Two-dimensional plots of the marginals of the 3D image in A, demonstrating the sharp rise in test inaccuracy for both false-positive rate in the training set or dilution of enhancer class in the test set.
Fig. 3.
Fig. 3.
Examples of reporter gene-expression patterns driven by (A) class I enhancers, (B) class II enhancers, and (C) genome regions misclassified by Kvon et al. (34) as nonenhancers in stages 4–6. Magnification is 20× and the embyos are 0.5 mm in length on average.
Fig. 4.
Fig. 4.
(A) Histogram of RF predicted enhancer probabilities for the entire genome. While >82% of the genome has P < 0.01, a secondary peak can be seen at P ∼ 0.95 (Inset). (BF) As validation, predicted enhancers were inserted into the Drosophila genome and were found to drive spatial expression. (G and H) Two enhancers, CEP01219 and CEP01220, are predicted proximal to the comm2 gene. Each of their patterns is a component of the comm2 expression pattern (I). (J) The genomic region of the two predicted enhancers is shown, along with the raw prediction track showing the predicted probability of enhancer activity with 100-bp resolution and the sum of TF binding ChIP scores at the same resolution. Magnification is 20×, and the embryos are 0.5 mm in length on average.
Fig. 5.
Fig. 5.
The significance (measured as the negative log of the P value) of GO-term enrichment in genes proximal to class I enhancers is very high in terms associated with development and segmentation (SDEs, yellow). For class II enhancers, no significant GO-term enrichment (P value below 10−5) is found (non-SDEs, blue).
Fig. 6.
Fig. 6.
(A) Feature importance is dominated with transcription factors, with the H3K4me1 the only histone mark in the top 25. (BF) “Local importance” measurements of randomly selected segments indicting how important each feature was in the segment classification when the forest was trained on (B) SDEs vs. nonenhancers, (C) SDE vs. non-SDEs, (D) non-SDEs vs. nonenhancers, (E) SDEs and non-SDEs vs. nonenhancers, and (F) SDEs vs. non-SDEs and nonenhancers. Feature order (x axis) can be found in SI Appendix.

References

    1. Fernández M, Miranda-Saavedra D. Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines. Nucleic Acids Res. 2012;40:e77. - PMC - PubMed
    1. Liu F, Li H, Ren C, Bo X, Shu W. PEDLA: Predicting enhancers with a deep learning-based algorithmic framework. Sci Rep. 2016;6:28517. - PMC - PubMed
    1. Rajagopal N, et al. RFECS: A random-forest based algorithm for enhancer identification from chromatin state. PLoS Comput Biol. 2013;9:e1002968. - PMC - PubMed
    1. Erwin GD, et al. Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput Biol. 2014;10:e1003677. - PMC - PubMed
    1. Jia C, He W. EnhancerPred: A predictor for discovering enhancers based on the combination and selection of multiple features. Sci Rep. 2016;6:38741. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources