De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

doi:10.1186/1471-2164-15-1047

. 2014 Dec 2:15:1047.

doi: 10.1186/1471-2164-15-1047.

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

Meng Niu, Ehsan S Tabari, Zhengchang Su¹

Affiliations

Affiliation

¹ Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA. zcsu@uncc.edu.

PMID: 25442502
PMCID: PMC4265420
DOI: 10.1186/1471-2164-15-1047

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

Meng Niu et al. BMC Genomics. 2014.

. 2014 Dec 2:15:1047.

doi: 10.1186/1471-2164-15-1047.

Authors

Meng Niu, Ehsan S Tabari, Zhengchang Su¹

Affiliation

¹ Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA. zcsu@uncc.edu.

PMID: 25442502
PMCID: PMC4265420
DOI: 10.1186/1471-2164-15-1047

Abstract

Background: In eukaryotes, transcriptional regulation is usually mediated by interactions of multiple transcription factors (TFs) with their respective specific cis-regulatory elements (CREs) in the so-called cis-regulatory modules (CRMs) in DNA. Although the knowledge of CREs and CRMs in a genome is crucial to elucidate gene regulatory networks and understand many important biological phenomena, little is known about the CREs and CRMs in most eukaryotic genomes due to the difficulty to characterize them by either computational or traditional experimental methods. However, the exponentially increasing number of TF binding location data produced by the recent wide adaptation of chromatin immunoprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs and CREs in genomes. Nonetheless, how to effectively mine these large volumes of ChIP data to identify CREs and CRMs at nucleotide resolution is a highly challenging task.

Results: We have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CREs and CRMs using a large number of ChIP datasets. DePCRM predicts CREs and CRMs by identifying overrepresented combinatorial CRE motif patterns in multiple ChIP datasets in an effective way. When applied to 168 ChIP datasets of 56 TFs from D. melanogaster, DePCRM identified 184 and 746 overrepresented CRE motifs and their combinatorial patterns, respectively, and predicted a total of 115,932 CRMs in the genome. The predictions recover 77.9% of known CRMs in the datasets and 89.3% of known CRMs containing at least one predicted CRE. We found that the putative CRMs as well as CREs as a whole in a CRM are more conserved than randomly selected sequences.

Conclusion: Our results suggest that the CRMs predicted by DePCRM are highly likely to be functional. Our algorithm is the first of its kind for de novo genome-wide prediction of CREs and CRMs using larger number of transcription factor ChIP datasets. The algorithm and predictions will hopefully facilitate the elucidation of gene regulatory networks in eukaryotes. All the predicted CREs, CRMs, and their target genes are available at http://bioinfo.uncc.edu/mniu/pcrms/www/.

PubMed Disclaimer

Figures

**Figure 1**
**A schematic view of our hypothesis.** If the binding peak is shorter than 3,000 bp, we equally extended from the two ends to have a length up to 3,000 bp. We assume that in addition to the CREs of the ChIP-ed TF (red circle), CREs of different cooperative TFs (the other shapes) are also enriched in the neighborhoods of at least some subsets of the binding peak dataset. Each line represents an extended binding peak sequence.

**Figure 2**
**A schematic of the major steps of the DePCRM algorithm. A**. Illustration of extended binding peaks from dataset d₁, d₂ and d₃ respectively. B. Illustration of CREs found within each dataset, CREs of the same motif are shown in the same shape and color. C. Construction of CP similarity graph. {P₁, P₂, P₃, P₄}, {P₅, P₆, P₇} and {P₈, P₉, P₁₀} are sets of CPs found in datasets d₁, d₂ and d₃ respectively. For clarity, the CPs formed between motifs in P₁ and motifs in P₂ and so on in the datasets are not shown. Each CP (represented as a rectangle) is a node of the multi-partied similarity graph, and two nodes are linked by an edge if and only if their S _s ≥ β, with S _s being the weight, which is not shown for clarity. D. By removing the dotted edges in panel C, MCL cuts the graph into five CP clusters (CPCs): C₁ = {P₁, P₅, P₈}; C₂ = {P₂, P₆}, C₃ = {P₃, P₉} , C₄ = {P₄, P₇)} and C₅ = {P₁₀}. CPs in a cluster are connected by edges in the same color. The singleton cluster C₅ = {P₁₀} is discarded for its low density. E. For each pair C_i and C_j from the four CPCs, we find sets of CPs from the same dataset d _k, and compute a co-occurring scores S _CPC (C_i, C_j) for the two CPCs. F. Construction of the CPC co-occurring graph using the four CPCs. Cutting the graph using MCL results in two CRMCs, {C1,C2 ,C3} and {C4}. G. After merging motifs into Unique motifs (Umotifs), we project the CREs of CRMCs to the genome and predict the CRMs.

**Figure 3**
**Analysis of the original datasets and motif finding results. A**. Distribution of the binding peak lengths in the 168 original datasets. Vast majority (99.38%) of the binding peaks are shorter than 5,000 bp. B. Number of motifs found in each of the 168 datasets as a function of the number of binding peaks the datasets. C. Distribution of the information content of the predicted motifs in the datasets. D. The rank of the ChIP-ed TF’s motif among the predicted motifs in the 99 datasets in which the motifs of the ChIP-ed TFs can be identified. The diamond on the bar indicates the rank of the ChIP-ed TF’s motif among the predicted motifs in the dataset. The higher the position of the diamond, the higher the rank of the target TF’s motif.

**Figure 4**
**Coverage of the datasets, predicted CREs and CRMs on the genome and its CDRs and NCRs.** *, the numbers above a line (sequence category) are the percentages of the CDRs and NCRs in the category. **, the numbers below a line are the percentages of CDRs and NCRs of the category with respect to the entire CDRs and NCRs in the genome. ***, the number on the right of a line is the percentage of the category with respect to the genome.

**Figure 5**
**Overlapping analysis of the datasets. A**. Hierarchical clustering of the 168 datasets for 56 TFs based on their pair-wise binding peak overlapping scores S _o. The blow-ups show two clusters for cooperative TFs (see Results). B. The motifs of TFs KR and HB are both found in the overlapping datasets GSM511084_Dmel-KR1 ChIP-ed by KR and GSM511081_Dmel-HB1 ChIP-ed by HB.

**Figure 6**
**Setting S** _C **cutoff α. A**. Distribution of co-occurring scores S _c of the motif pairs found in the 16 datasets. The curve is a fitting of the left portion of the distribution to a Gaussian distribution N(μ = 0.19, σ = 0.067). B. The remaining proportions of predicted motifs and known CRMs as functions of the S _c cutoff α. The vertical line indicates the position of the chosen cutoff α = 0.7 for selecting co-occurring motif pairs (CPs).

**Figure 7**
**Setting S** _S **cutoff β and S** _CPC **cutoff γ. A**. The density of the CP similarity graph drops rapidly with the increase in the S _s cutoff β, but the trend of decrease slows down around β =1.36. B. The number of CRM in the graph also starts to drop rapidly around β = 1.36. Thus we set β =1.36 for construing the final CP similarity graph. C. The distribution of CPC co-occurring scores S _CPC are well separated into a low-scoring component and a high-scoring component. The vertical line indicates the S _CPC cutoff γ =0.69 at the deepest valley between the two peaks, for constructing the CPC co-occurring graph.

**Figure 8**
**Summary of the predicted CRMs. A**. Distribution of the lengths of the known and predicted CRMs. B. Distribution of the distances (bp) between two adjacent CREs in the known and predicted CRMs. C. Recovery rates of the known CRMs in the datasets (1330) and the known CRMs containing a predicted CRE (1061) by the predicted CRMs and the corresponding same number and length sequences randomly selected from NCRs.

**Figure 9**
**Conservation analysis of the CRMs. A**. Distribution of average phastCons scores of the predicted CRMs in NCRs and of the same number and length sequences randomly selected from NCRs. The vertical dashed lines indicate the PhastCons score cutoffs for highly conserved (≥0.98) and non-conserved (≤0.02) CRMs. B. Distribution of average phastCons scores of all putative CREs in a predicted CRMs in NCRs and of the same number and length sequences randomly selected from NCRs. C. Distribution of average phastCons scores of single predicted CREs in NCRs and of the same number and length sequences randomly selected from NCRs. D. Distribution of average phastCons scores of single predicted CREs in CDRs and of the same number and length sequences randomly selected from CDRs. E. Distribution of average phastCons scores of the non-redundant original binding peaks in NCRs and of the same number and length sequences randomly selected from NCRs. F. Distribution of average phastCons scores of single predicted CREs in the original binding peaks in NCRs and of the same number and length sequences randomly selected from the original binding peaks in NCRs.

See this image and copyright information in PMC

Cited by

Maps of context-dependent putative regulatory regions and genomic signal interactions.
Diamanti K, Umer HM, Kruczyk M, Dąbrowski MJ, Cavalli M, Wadelius C, Komorowski J. Diamanti K, et al. Nucleic Acids Res. 2016 Nov 2;44(19):9110-9120. doi: 10.1093/nar/gkw800. Epub 2016 Sep 12. Nucleic Acids Res. 2016. PMID: 27625394 Free PMC article.
Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans.
Ni P, Su Z. Ni P, et al. NAR Genom Bioinform. 2021 Jun 17;3(2):lqab052. doi: 10.1093/nargab/lqab052. eCollection 2021 Jun. NAR Genom Bioinform. 2021. PMID: 34159315 Free PMC article.
Towards a map of cis-regulatory sequences in the human genome.
Niu M, Tabari E, Ni P, Su Z. Niu M, et al. Nucleic Acids Res. 2018 Jun 20;46(11):5395-5409. doi: 10.1093/nar/gky338. Nucleic Acids Res. 2018. PMID: 29733395 Free PMC article.
Modeling the cis-regulatory modules of genes expressed in developmental stages of Drosophila melanogaster.
López Y, Vandenbon A, Nose A, Nakai K. López Y, et al. PeerJ. 2017 May 30;5:e3389. doi: 10.7717/peerj.3389. eCollection 2017. PeerJ. 2017. PMID: 28584716 Free PMC article.
REDfly: the transcriptional regulatory element database for Drosophila.
Rivera J, Keränen SVE, Gallo SM, Halfon MS. Rivera J, et al. Nucleic Acids Res. 2019 Jan 8;47(D1):D828-D834. doi: 10.1093/nar/gky957. Nucleic Acids Res. 2019. PMID: 30329093 Free PMC article.

See all "Cited by" articles

References

1. Consortium CeS Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282(5396):2012–2018. doi: 10.1126/science.282.5396.2012. - DOI - PubMed
1. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012;40(Database issue):D571–D579. doi: 10.1093/nar/gkr1100. - DOI - PMC - PubMed
1. Heard E, Tishkoff S, Todd JA, Vidal M, Wagner GP, Wang J, Weigel D, Young R. Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet. 2010;11(10):723–733. doi: 10.1038/nrg2878. - DOI - PMC - PubMed
1. Collins F. Has the revolution arrived? Nature. 2010;464(7289):674–675. doi: 10.1038/464674a. - DOI - PMC - PubMed
1. Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–640. doi: 10.1126/science.1105136. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- FlyBase
Miscellaneous
- NCI CPTAC Assay Portal

[1] Consortium CeS Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282(5396):2012–2018. doi: 10.1126/science.282.5396.2012. - DOI - PubMed

[2] Consortium CeS Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282(5396):2012–2018. doi: 10.1126/science.282.5396.2012. - DOI - PubMed

[3] Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012;40(Database issue):D571–D579. doi: 10.1093/nar/gkr1100. - DOI - PMC - PubMed

[4] Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012;40(Database issue):D571–D579. doi: 10.1093/nar/gkr1100. - DOI - PMC - PubMed

[5] Heard E, Tishkoff S, Todd JA, Vidal M, Wagner GP, Wang J, Weigel D, Young R. Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet. 2010;11(10):723–733. doi: 10.1038/nrg2878. - DOI - PMC - PubMed

[6] Heard E, Tishkoff S, Todd JA, Vidal M, Wagner GP, Wang J, Weigel D, Young R. Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet. 2010;11(10):723–733. doi: 10.1038/nrg2878. - DOI - PMC - PubMed

[7] Collins F. Has the revolution arrived? Nature. 2010;464(7289):674–675. doi: 10.1038/464674a. - DOI - PMC - PubMed

[8] Collins F. Has the revolution arrived? Nature. 2010;464(7289):674–675. doi: 10.1038/464674a. - DOI - PMC - PubMed

[9] Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–640. doi: 10.1126/science.1105136. - DOI - PubMed

[10] Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–640. doi: 10.1126/science.1105136. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

Affiliation

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous