Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 26;10(6):e1003677.
doi: 10.1371/journal.pcbi.1003677. eCollection 2014 Jun.

Integrating diverse datasets improves developmental enhancer prediction

Affiliations

Integrating diverse datasets improves developmental enhancer prediction

Genevieve D Erwin et al. PLoS Comput Biol. .

Abstract

Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further investigate questions in developmental biology.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overview of the EnhancerFinder enhancer prediction pipeline.
In our two-step approach, regions of the genome are characterized by diverse features, such as their evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence patterns. For each step, appropriate positive (green) and negative (purple) training examples are provided as input to a multiple kernel learning (MKL) algorithm that produces a trained classifier. We used 10-fold cross validation to evaluate the performance of all classifiers. In Step 1, we trained a classifier to distinguish between known developmental enhancers from VISTA and the genomic background. In Step 2, we trained several classifiers to distinguish enhancers active in tissues of interest from those without activity in the tissue according to VISTA. We applied the trained enhancer classifier from Step 1 to the entire human genome to produce more than 80,000 developmental enhancer predictions. We then applied the tissue-specific enhancer classifiers from Step 2 to further refine our predictions.
Figure 2
Figure 2. Combining diverse data using EnhancerFinder improves the identification of developmental enhancers.
(A) Enhancer prediction strategies based on functional genomics data, evolutionary conservation, and DNA sequence motif patterns all perform well, but EnhancerFinder, which combines these data, provides significant improvement over each of them alone (p<2.0E-7 for all). (B) Each of the approaches from (A) predicts that somewhat different sets of the VISTA regions are enhancers. This suggests that complementary information is contained in each data source. EnhancerFinder (not shown), which combines them, captures many of the enhancers that are unique to each source; it predicts 25 of the 44 enhancers unique to Functional Genomics, 30 of the 76 unique to DNA Sequence Motifs, and 34 of the 111 unique to Evolutionary Conservation. (C) EnhancerFinder outperforms CLARE, a successful enhancer prediction method based on known regulatory motifs. We also evaluated the enhancer states predicted by ChromHMM and Segway, two unsupervised clustering methods that have been used to segment the genome into different functional states based on patterns in functional genomics data, though these methods were not applied to developmental contexts. The different X's represent state predictions based on data from different ENCODE cell types: GM12878 (blue), H1-hESC (violet), HepG2 (brown), HMEC (tan), HSMM (gray), HUVEC (light green), K562 (green), NHEK (orange), NHLF (light blue), and all contexts combined (red).
Figure 3
Figure 3. Integrating diverse functional genomics data improves enhancer prediction.
(A) Considering functional genomics features from contexts and assays not directly associated with developmental enhancer activity (All Functional Genomics and Relevant Functional Genomics) improves the identification of developmental enhancers (p = 9.2E-9 and p = 2.7E-6, respectively, compared to Embryonic Functional Genomics only). (B) Combining available H3K4me1, p300, and H3K27ac data, which are commonly used in isolation to identify enhancers, in a linear SVM (Basic Functional Genomics) is better able to distinguish known developmental enhancers from the genomic background than considering each type of data alone (p<2E-7, for each). However, combining these marks still performs significantly worse than EnhancerFinder (Figure 2A; AUC = 0.96) and considering additional data as in (A).
Figure 4
Figure 4. Enhancers of heart expression are easier to identify than enhancers active in other tissues at E11.5.
(A) In Step 2 of our prediction pipeline, we trained EnhancerFinder using the same features as in Step 1 (Figure 1), but using VISTA enhancers active in a given tissue as positives and tested regions that did not show activity in the tissue as negatives. Heart enhancers were dramatically easier to distinguish from other enhancers than enhancers of expression in other tissues. The heart enhancers have significantly higher GC content than other enhancers and the genomic background. This and several other unique attributes may explain the ease of identifying them (Figures S7 and S8). In general, functional genomics data are the most informative data type for predicting enhancer tissue specificity (Table 1).
Figure 5
Figure 5. EnhancerFinder's two-step approach captures tissue-specific attributes of enhancers.
(A) The true overlap of human enhancers of brain, heart, and limb in the VISTA database. The vast majority of characterized enhancers are unique to one of these tissues at this stage. For example, of the 84 validated heart enhancers, 71 are unique to heart, five are shared with brain, four with limb, and four with both. (B) The predicted overlap of VISTA enhancers based on predictions made with a single training step using MKL with only enhancers of that tissue considered positives and the genomic background as negatives. This approach overestimates the number of enhancers active in multiple tissues. Each classifier mainly learns general attributes of enhancers, rather than tissue-specific attributes. (C) The predicted overlap based on EnhancerFinder's two-step approach. These predictions are much more tissue-specific and exhibit overlaps between tissues similar to the true values (A). Predicted tissue distributions are similar when the methods are applied to other genomic regions, as illustrated in our genome-wide predictions, but only predictions on VISTA enhancers are shown here to enable comparisons to the distribution for validated enhancers (A).
Figure 6
Figure 6. Predicted tissue-specific enhancers exhibit tissue-specific characteristics.
EnhancerFinder identifies thousands of novel high-confidence (FPR<0.05) heart, brain, and limb enhancers. These enhancers are enriched for tissue-specific GO Biological Processes. The five most enriched GO Biological Processes among genes near each enhancer set (as calculated using GREAT) are listed in the colored boxes. Nearly 90% of EnhancerFinder predicted heart, brain, and limb enhancers are unique to a single tissue. The larger number of high-confidence heart enhancers relative to brain and limb enhancers is the result of the superior performance of the heart classifier.
Figure 7
Figure 7. Four novel developmental enhancers near FOXC2.
This UCSC Genome Browser (http://genome.ucsc.edu) snapshot shows the genomic context of four candidate human enhancers tested in transgenic zebrafish. For each enhancer, we show a zebrafish image that is representative of the reproducible expression patterns. FOXC2 Enhancer Candidate 1 (F2EC-1) drives expression at 48 hpf in the eye and epidermis (arrows). F2EC-2 shows expression at 24 hpf in the forebrain, midbrain, and nerve. F2EC-3 drives expression at 48 hpf in the epidermis and heart. F2EC-4 shows expression at 48 hpf in the notochord, spinal cord, and heart. See Table S6 for full list of expressed tissues seen in each candidate enhancer and Figure S10 for results on candidate enhancers near FOXC1.
Figure 8
Figure 8. A novel cranial nerve enhancer in the ZEB2 locus.
This UCSC Genome Browser snapshot shows a dense region of predicted enhancers in a 1.5ZEB2 and part of the adjacent gene desert. Tracks give the locations of four human accelerated regions (HARs), two validated VISTA enhancers (hs407 and hs1802), and the E1 region recently shown to have postnatal enhancer activity . The inset shows a zoomed in view of ZEB2 (hg19.chr2:145,100,000–145,425,000) along with summaries of several ENCODE functional genomics datasets and evolutionary conservation across placental mammals. We tested the predicted enhancer overlapping 2xHAR.240 for enhancer activity at E11.5 in transgenic mice. Both the human and chimp versions of this sequence drive consistent expression in the cranial nerve (Figure S11).

Similar articles

Cited by

References

    1. Ong CT, Corces VG (2011) Enhancer function: new insights into the regulation of tissue-specific gene expression. Nature reviews Genetics 12: 283–293. - PMC - PubMed
    1. Bulger M, Groudine M (2011) Functional and mechanistic diversity of distal transcription enhancers. Cell 144: 327–339. - PMC - PubMed
    1. Visel A, Rubin EM, Pennacchio LA (2009) Genomic views of distant-acting enhancers. Nature 461: 199–205. - PMC - PubMed
    1. Sakabe NJ, Savic D, Nobrega MA (2012) Transcriptional enhancers in development and disease. Genome biology 13: 238. - PMC - PubMed
    1. Ahituv N (2012) Gene regulatory sequences and human disease. New York: Springer. x, 283 pages p.

Publication types