Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Dec 2;6(12):e1001020.
doi: 10.1371/journal.pcbi.1001020.

Assessing computational methods of cis-regulatory module prediction

Affiliations
Comparative Study

Assessing computational methods of cis-regulatory module prediction

Jing Su et al. PLoS Comput Biol. .

Abstract

Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic representation of cis-regulatory modules.
A cis-regulatory module contains multiple binding sites of multiple transcription factors within a compact sequence interval. The binding affinity and the orientation of each binding site, the spacing and cooperation relationship between binding sites, and the relevant distance of cis-regulatory module to transcription start site of the gene it regulates may all be important properties of a given cis-regulatory module.
Figure 2
Figure 2. Classification of search strategies.
Search strategies for the CRM prediction methods can be broadly subdivided into four families: window clustering, probabilistic modelling, phylogenetic footprinting, and discriminative modelling.
Figure 3
Figure 3. Properties of CRM prediction methods.
Figure 4
Figure 4. Ranking of methods (short introns and exons).
A. Predictions of CRMs against short introns. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. B. Predictions of CRMs against exons. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. C. Ranking of methods by Area Under Curve scores.
Figure 5
Figure 5. Ranking of methods (medium length introns and intergenic regions).
A. Predictions of CRMs against medium length introns. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. B. Predictions of CRMs against intergenic regions. There are two ROC curves for each method, one for 500 bp and one for 200 bp window size. C. The Area Under Curve scores of the assessed methods.
Figure 6
Figure 6. Correlations and complementarity of methods.
A. Correlation coefficients of predictions on CRMs. B. Performance of pairs of methods. C. Improvement made by combining pairs of methods: StubbMS_w200 and StubbMS_w500, CisPlusFinder_w200 and MSCAN_w200 to StubbMS_w200 and StubbMS_w500.
Figure 7
Figure 7. Correlation coefficients between predictions and sequence features.
A. Correlation coefficients between predictions and sequence conservations, with 95% bootstrap confidence interval. B. Correlation coefficients between predictions and sequence lengths, with 95% bootstrap confidence interval.
Figure 8
Figure 8. Correlation coefficients between predictions and CRM properties.
A. Correlation between predictions and the total number of TFBSs, with 95% bootstrap confidence interval. B. Correlation between predictions and the total number of TFs, with 95% bootstrap confidence interval. C. Correlation between predictions and the number of TFBSs/number of TFs, with 95% bootstrap confidence interval. D. Correlation between predictions and the number of TFBSs/sequence length, with 95% bootstrap confidence interval.
Figure 9
Figure 9. Comparison between the conservation degrees of transcription factor binding sites, CRMs and noncoding regions of Drosophila genome and human ENCODE regions.
The probability density shows that, for Drosophila, the REDfly CRMs are more conserved than the transcription factor binding sites. For human ENCODE regions, the transcription factor binding sites are more conserved than the DNaseI hypersensitive sites.
Figure 10
Figure 10. Predictions on ENCODE regions.
The performance of methods ranks them from top to bottom in this order: Regulatory Potential, MorphMS, ClusterBuster, phastCons score, and EEL.
Figure 11
Figure 11. The majority existing methods target regions rich of cis-regulatory elements.
Existing methods predict CRMs based on their distance and conservation features. This fact limits their targets are regions rich of closely located and highly conserved cis-regulatory elements (green region) instead of real functional modular CRMs (pink regions). Consequently, they will miss out: those CRMs composed of elements not conserved in a same order, e.g. CRM 1; those CRMs not conserved, e.g. CRM 2; or those CRMs composed of further apart elements, e.g. CRM 3. At the same time, uncompleted regions within a CRM, or overlapped regions shared by more than one CRMs, could be predicted as a false positive, e.g. a false positive prediction composed of CRM 2 and part of CRM 3.

Similar articles

Cited by

References

    1. Davidson E. The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. San Diego (California): Academic Press/Elsevier; 2006.
    1. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–287. - PubMed
    1. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A. 2002;99:757–762. - PMC - PubMed
    1. Schroeder MD, Pearce M, Fak J, Fan H, Unnerstall U, et al. Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol. 2004;2:E271. - PMC - PubMed
    1. Gompel N, Prud'homme B, Wittkopp PJ, Kassner VA, Carroll SB. Chance caught on the wing: cis-regulatory evolution and the origin of pigment patterns in Drosophila. Nature. 2005;433:481–487. - PubMed

Publication types

Substances

LinkOut - more resources