Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 2:15:1047.
doi: 10.1186/1471-2164-15-1047.

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

Affiliations

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

Meng Niu et al. BMC Genomics. .

Abstract

Background: In eukaryotes, transcriptional regulation is usually mediated by interactions of multiple transcription factors (TFs) with their respective specific cis-regulatory elements (CREs) in the so-called cis-regulatory modules (CRMs) in DNA. Although the knowledge of CREs and CRMs in a genome is crucial to elucidate gene regulatory networks and understand many important biological phenomena, little is known about the CREs and CRMs in most eukaryotic genomes due to the difficulty to characterize them by either computational or traditional experimental methods. However, the exponentially increasing number of TF binding location data produced by the recent wide adaptation of chromatin immunoprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs and CREs in genomes. Nonetheless, how to effectively mine these large volumes of ChIP data to identify CREs and CRMs at nucleotide resolution is a highly challenging task.

Results: We have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CREs and CRMs using a large number of ChIP datasets. DePCRM predicts CREs and CRMs by identifying overrepresented combinatorial CRE motif patterns in multiple ChIP datasets in an effective way. When applied to 168 ChIP datasets of 56 TFs from D. melanogaster, DePCRM identified 184 and 746 overrepresented CRE motifs and their combinatorial patterns, respectively, and predicted a total of 115,932 CRMs in the genome. The predictions recover 77.9% of known CRMs in the datasets and 89.3% of known CRMs containing at least one predicted CRE. We found that the putative CRMs as well as CREs as a whole in a CRM are more conserved than randomly selected sequences.

Conclusion: Our results suggest that the CRMs predicted by DePCRM are highly likely to be functional. Our algorithm is the first of its kind for de novo genome-wide prediction of CREs and CRMs using larger number of transcription factor ChIP datasets. The algorithm and predictions will hopefully facilitate the elucidation of gene regulatory networks in eukaryotes. All the predicted CREs, CRMs, and their target genes are available at http://bioinfo.uncc.edu/mniu/pcrms/www/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A schematic view of our hypothesis. If the binding peak is shorter than 3,000 bp, we equally extended from the two ends to have a length up to 3,000 bp. We assume that in addition to the CREs of the ChIP-ed TF (red circle), CREs of different cooperative TFs (the other shapes) are also enriched in the neighborhoods of at least some subsets of the binding peak dataset. Each line represents an extended binding peak sequence.
Figure 2
Figure 2
A schematic of the major steps of the DePCRM algorithm. A. Illustration of extended binding peaks from dataset d1, d2 and d3 respectively. B. Illustration of CREs found within each dataset, CREs of the same motif are shown in the same shape and color. C. Construction of CP similarity graph. {P1, P2, P3, P4}, {P5, P6, P7} and {P8, P9, P10} are sets of CPs found in datasets d1, d2 and d3 respectively. For clarity, the CPs formed between motifs in P1 and motifs in P2 and so on in the datasets are not shown. Each CP (represented as a rectangle) is a node of the multi-partied similarity graph, and two nodes are linked by an edge if and only if their S s ≥ β, with S s being the weight, which is not shown for clarity. D. By removing the dotted edges in panel C, MCL cuts the graph into five CP clusters (CPCs): C1 = {P1, P5, P8}; C2 = {P2, P6}, C3 = {P3, P9} , C4 = {P4, P7)} and C5 = {P10}. CPs in a cluster are connected by edges in the same color. The singleton cluster C5 = {P10} is discarded for its low density. E. For each pair Ci and Cj from the four CPCs, we find sets of CPs from the same dataset d k, and compute a co-occurring scores S CPC (Ci, Cj) for the two CPCs. F. Construction of the CPC co-occurring graph using the four CPCs. Cutting the graph using MCL results in two CRMCs, {C1,C2 ,C3} and {C4}. G. After merging motifs into Unique motifs (Umotifs), we project the CREs of CRMCs to the genome and predict the CRMs.
Figure 3
Figure 3
Analysis of the original datasets and motif finding results. A. Distribution of the binding peak lengths in the 168 original datasets. Vast majority (99.38%) of the binding peaks are shorter than 5,000 bp. B. Number of motifs found in each of the 168 datasets as a function of the number of binding peaks the datasets. C. Distribution of the information content of the predicted motifs in the datasets. D. The rank of the ChIP-ed TF’s motif among the predicted motifs in the 99 datasets in which the motifs of the ChIP-ed TFs can be identified. The diamond on the bar indicates the rank of the ChIP-ed TF’s motif among the predicted motifs in the dataset. The higher the position of the diamond, the higher the rank of the target TF’s motif.
Figure 4
Figure 4
Coverage of the datasets, predicted CREs and CRMs on the genome and its CDRs and NCRs. *, the numbers above a line (sequence category) are the percentages of the CDRs and NCRs in the category. **, the numbers below a line are the percentages of CDRs and NCRs of the category with respect to the entire CDRs and NCRs in the genome. ***, the number on the right of a line is the percentage of the category with respect to the genome.
Figure 5
Figure 5
Overlapping analysis of the datasets. A. Hierarchical clustering of the 168 datasets for 56 TFs based on their pair-wise binding peak overlapping scores S o. The blow-ups show two clusters for cooperative TFs (see Results). B. The motifs of TFs KR and HB are both found in the overlapping datasets GSM511084_Dmel-KR1 ChIP-ed by KR and GSM511081_Dmel-HB1 ChIP-ed by HB.
Figure 6
Figure 6
Setting S C cutoff α. A. Distribution of co-occurring scores S c of the motif pairs found in the 16 datasets. The curve is a fitting of the left portion of the distribution to a Gaussian distribution N(μ = 0.19, σ = 0.067). B. The remaining proportions of predicted motifs and known CRMs as functions of the S c cutoff α. The vertical line indicates the position of the chosen cutoff α = 0.7 for selecting co-occurring motif pairs (CPs).
Figure 7
Figure 7
Setting S S cutoff β and S CPC cutoff γ. A. The density of the CP similarity graph drops rapidly with the increase in the S s cutoff β, but the trend of decrease slows down around β =1.36. B. The number of CRM in the graph also starts to drop rapidly around β = 1.36. Thus we set β =1.36 for construing the final CP similarity graph. C. The distribution of CPC co-occurring scores S CPC are well separated into a low-scoring component and a high-scoring component. The vertical line indicates the S CPC cutoff γ =0.69 at the deepest valley between the two peaks, for constructing the CPC co-occurring graph.
Figure 8
Figure 8
Summary of the predicted CRMs. A. Distribution of the lengths of the known and predicted CRMs. B. Distribution of the distances (bp) between two adjacent CREs in the known and predicted CRMs. C. Recovery rates of the known CRMs in the datasets (1330) and the known CRMs containing a predicted CRE (1061) by the predicted CRMs and the corresponding same number and length sequences randomly selected from NCRs.
Figure 9
Figure 9
Conservation analysis of the CRMs. A. Distribution of average phastCons scores of the predicted CRMs in NCRs and of the same number and length sequences randomly selected from NCRs. The vertical dashed lines indicate the PhastCons score cutoffs for highly conserved (≥0.98) and non-conserved (≤0.02) CRMs. B. Distribution of average phastCons scores of all putative CREs in a predicted CRMs in NCRs and of the same number and length sequences randomly selected from NCRs. C. Distribution of average phastCons scores of single predicted CREs in NCRs and of the same number and length sequences randomly selected from NCRs. D. Distribution of average phastCons scores of single predicted CREs in CDRs and of the same number and length sequences randomly selected from CDRs. E. Distribution of average phastCons scores of the non-redundant original binding peaks in NCRs and of the same number and length sequences randomly selected from NCRs. F. Distribution of average phastCons scores of single predicted CREs in the original binding peaks in NCRs and of the same number and length sequences randomly selected from the original binding peaks in NCRs.

Similar articles

Cited by

References

    1. Consortium CeS Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282(5396):2012–2018. doi: 10.1126/science.282.5396.2012. - DOI - PubMed
    1. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012;40(Database issue):D571–D579. doi: 10.1093/nar/gkr1100. - DOI - PMC - PubMed
    1. Heard E, Tishkoff S, Todd JA, Vidal M, Wagner GP, Wang J, Weigel D, Young R. Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet. 2010;11(10):723–733. doi: 10.1038/nrg2878. - DOI - PMC - PubMed
    1. Collins F. Has the revolution arrived? Nature. 2010;464(7289):674–675. doi: 10.1038/464674a. - DOI - PMC - PubMed
    1. Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–640. doi: 10.1126/science.1105136. - DOI - PubMed

Publication types

Substances

LinkOut - more resources