Comparative Study

. 2010 Dec 2;6(12):e1001020.

doi: 10.1371/journal.pcbi.1001020.

Assessing computational methods of cis-regulatory module prediction

Jing Su¹, Sarah A Teichmann, Thomas A Down

Affiliations

PMID: 21152003
PMCID: PMC2996316
DOI: 10.1371/journal.pcbi.1001020

Comparative Study

Assessing computational methods of cis-regulatory module prediction

Jing Su et al. PLoS Comput Biol. 2010.

. 2010 Dec 2;6(12):e1001020.

doi: 10.1371/journal.pcbi.1001020.

Authors

Jing Su¹, Sarah A Teichmann, Thomas A Down

Affiliation

¹ MRC Laboratory of Molecular Biology, Cambridge, United Kingdom.

PMID: 21152003
PMCID: PMC2996316
DOI: 10.1371/journal.pcbi.1001020

Abstract

Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Schematic representation of cis-regulatory modules.**
A cis-regulatory module contains multiple binding sites of multiple transcription factors within a compact sequence interval. The binding affinity and the orientation of each binding site, the spacing and cooperation relationship between binding sites, and the relevant distance of cis-regulatory module to transcription start site of the gene it regulates may all be important properties of a given cis-regulatory module.

**Figure 2. Classification of search strategies.**
Search strategies for the CRM prediction methods can be broadly subdivided into four families: window clustering, probabilistic modelling, phylogenetic footprinting, and discriminative modelling.

**Figure 3. Properties of CRM prediction methods.**

**Figure 4. Ranking of methods (short introns and exons).**
A. Predictions of CRMs against short introns. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. B. Predictions of CRMs against exons. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. C. Ranking of methods by Area Under Curve scores.

**Figure 5. Ranking of methods (medium length introns and intergenic regions).**
A. Predictions of CRMs against medium length introns. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. B. Predictions of CRMs against intergenic regions. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. C. The Area Under Curve scores of the assessed methods.

**Figure 6. Correlations and complementarity of methods.**
A. Correlation coefficients of predictions on CRMs. B. Performance of pairs of methods. C. Improvement made by combining pairs of methods: StubbMS_w200 and StubbMS_w500, CisPlusFinder_w200 and MSCAN_w200 to StubbMS_w200 and StubbMS_w500.

**Figure 7. Correlation coefficients between predictions and sequence features.**
A. Correlation coefficients between predictions and sequence conservations, with 95% bootstrap confidence interval. B. Correlation coefficients between predictions and sequence lengths, with 95% bootstrap confidence interval.

**Figure 8. Correlation coefficients between predictions and CRM properties.**
A. Correlation between predictions and the total number of TFBSs, with 95% bootstrap confidence interval. B. Correlation between predictions and the total number of TFs, with 95% bootstrap confidence interval. C. Correlation between predictions and the number of TFBSs/number of TFs, with 95% bootstrap confidence interval. D. Correlation between predictions and the number of TFBSs/sequence length, with 95% bootstrap confidence interval.

**Figure 9. Comparison between the conservation degrees of transcription factor binding sites, CRMs and noncoding regions of Drosophila genome and human ENCODE regions.**
The probability density shows that, for *Drosophila*, the REDfly CRMs are more conserved than the transcription factor binding sites. For human ENCODE regions, the transcription factor binding sites are more conserved than the DNaseI hypersensitive sites.

**Figure 10. Predictions on ENCODE regions.**
The performance of methods ranks them from top to bottom in this order: Regulatory Potential, MorphMS, ClusterBuster, phastCons score, and EEL.

**Figure 11. The majority existing methods target regions rich of cis-regulatory elements.**
Existing methods predict CRMs based on their distance and conservation features. This fact limits their targets are regions rich of closely located and highly conserved *cis-regulatory elements* (green region) instead of real functional modular CRMs (pink regions). Consequently, they will miss out: those CRMs composed of elements not conserved in a same order, e.g. CRM 1; those CRMs not conserved, e.g. CRM 2; or those CRMs composed of further apart elements, e.g. CRM 3. At the same time, uncompleted regions within a CRM, or overlapped regions shared by more than one CRMs, could be predicted as a false positive, e.g. a false positive prediction composed of CRM 2 and part of CRM 3.

See this image and copyright information in PMC

Cited by

A statistical thin-tail test of predicting regulatory regions in the Drosophila genome.
Shu JJ, Li Y. Shu JJ, et al. Theor Biol Med Model. 2013 Feb 14;10:11. doi: 10.1186/1742-4682-10-11. Theor Biol Med Model. 2013. PMID: 23409927 Free PMC article.
Identifying transcriptional cis-regulatory modules in animal genomes.
Suryamohan K, Halfon MS. Suryamohan K, et al. Wiley Interdiscip Rev Dev Biol. 2015 Mar-Apr;4(2):59-84. doi: 10.1002/wdev.168. Epub 2014 Dec 29. Wiley Interdiscip Rev Dev Biol. 2015. PMID: 25704908 Free PMC article. Review.
Annotating the Insect Regulatory Genome.
Asma H, Halfon MS. Asma H, et al. Insects. 2021 Jun 29;12(7):591. doi: 10.3390/insects12070591. Insects. 2021. PMID: 34209769 Free PMC article. Review.
LOESS correction for length variation in gene set-based genomic sequence analysis.
Aboukhalil A, Bulyk ML. Aboukhalil A, et al. Bioinformatics. 2012 Jun 1;28(11):1446-54. doi: 10.1093/bioinformatics/bts155. Epub 2012 Apr 5. Bioinformatics. 2012. PMID: 22492312 Free PMC article.
Integrating motif, DNA accessibility and gene expression data to build regulatory maps in an organism.
Blatti C, Kazemian M, Wolfe S, Brodsky M, Sinha S. Blatti C, et al. Nucleic Acids Res. 2015 Apr 30;43(8):3998-4012. doi: 10.1093/nar/gkv195. Epub 2015 Mar 19. Nucleic Acids Res. 2015. PMID: 25791631 Free PMC article.

See all "Cited by" articles

References

1. Davidson E. The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. San Diego (California): Academic Press/Elsevier; 2006.
1. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–287. - PubMed
1. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A. 2002;99:757–762. - PMC - PubMed
1. Schroeder MD, Pearce M, Fak J, Fan H, Unnerstall U, et al. Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol. 2004;2:E271. - PMC - PubMed
1. Gompel N, Prud'homme B, Wittkopp PJ, Kassner VA, Carroll SB. Chance caught on the wing: cis-regulatory evolution and the origin of pigment patterns in Drosophila. Nature. 2005;433:481–487. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing computational methods of cis-regulatory module prediction

Affiliation

Assessing computational methods of cis-regulatory module prediction

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases