Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Aug 29;4(8):e1000156.
doi: 10.1371/journal.pcbi.1000156.

PhyloGibbs-MP: module prediction and discriminative motif-finding by Gibbs sampling

Affiliations

PhyloGibbs-MP: module prediction and discriminative motif-finding by Gibbs sampling

Rahul Siddharthan. PLoS Comput Biol. .

Abstract

PhyloGibbs, our recent Gibbs-sampling motif-finder, takes phylogeny into account in detecting binding sites for transcription factors in DNA and assigns posterior probabilities to its predictions obtained by sampling the entire configuration space. Here, in an extension called PhyloGibbs-MP, we widen the scope of the program, addressing two major problems in computational regulatory genomics. First, PhyloGibbs-MP can localise predictions to small, undetermined regions of a large input sequence, thus effectively predicting cis-regulatory modules (CRMs) ab initio while simultaneously predicting binding sites in those modules-tasks that are usually done by two separate programs. PhyloGibbs-MP's performance at such ab initio CRM prediction is comparable with or superior to dedicated module-prediction software that use prior knowledge of previously characterised transcription factors. Second, PhyloGibbs-MP can predict motifs that differentiate between two (or more) different groups of regulatory regions, that is, motifs that occur preferentially in one group over the others. While other "discriminative motif-finders" have been published in the literature, PhyloGibbs-MP's implementation has some unique features and flexibility. Benchmarks on synthetic and actual genomic data show that this algorithm is successful at enhancing predictions of differentiating sites and suppressing predictions of common sites and compares with or outperforms other discriminative motif-finders on actual genomic data. Additional enhancements include significant performance and speed improvements, the ability to use "informative priors" on known transcription factors, and the ability to output annotations in a format that can be visualised with the Generic Genome Browser. In stand-alone motif-finding, PhyloGibbs-MP remains competitive, outperforming PhyloGibbs-1.0 and other programs on benchmark data.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Figure 1
Figure 1. The performance of various motif-finders on predicting yeast binding-site data taken from SCPD.
Specificity (the fraction of predicted sites that are present in SCPD) is plotted as a function of sensitivity (the fraction of SCPD sites that are found by the motif-finder); sensitivity is varied by cutting off predictions below a varying significance threshold as reported by the individual program. Three runs of PhyloGibbs-MP are reported: phylogibbs-mp-n8 is a run that specifies a maximum of 8 colours (types of motif); phylogibbs-mp-n8-I is the same, but with “importance sampling” turned off; and phylogibbs-mp-n3 is a run that specifies a maximum of 3 colours.
Figure 2
Figure 2. The performance of various motif-finders on predicting binding sites in D. melanogaster taken from REDfly 2.0.
The interpretation is similar to that in Figure 1.
Figure 3
Figure 3. The performance of PhyloGibbs-MP in discriminative and non-discriminative mode, on synthetic data, compared with other programs.
Each data set consists of two sets of sequences, with one “common” motif embedded in both sets and two “discriminative” motifs embedded one in each set, with five copies per sequence per motif per set. Specificity as a function of sensitivity is shown. For PhyloGibbs-MP, “nodiff” indicates non-discriminative mode, while the other labels indicate the value of the discriminative parameter ( -d): 0.1, 0.4 or 0.99. This figure shows performance in detecting common motifs on these data; Figure 4 shows performance in detecting discriminative motifs.
Figure 4
Figure 4. Performance of various programs in detecting discriminative motifs, on the same data as in Figure 3.
Figure 5
Figure 5. Performance of discriminative motif-finders on pairs of regulatory regions from yeast.
Figure 6
Figure 6. Performance of discriminative motif-finders on pairs of regulatory sequence from fly.
Figure 7
Figure 7. For fifteen transcription factors bound by between 4 and 9 sequences with p<0.001 in ChIP-chip experiments reported by Harbison et al. , weight matrices reported by those authors, in both orientations, compared with predictions of four discriminative motif-finders on binding sequences discriminated against randomly chosen non-binding sequences.
No other prior information was used. PhyloGibbs-MP does not internally characterise discriminative sets as “positive” or “negative” but only predictions from the positive set (including, in some cases, multiple predictions) are reported. Other programs make at most one prediction per set. All programs report position weight matrices, which were used directly to generate sequence logos (using WebLogo and some helper scripts). The predictions are discussed, qualitatively, in the text.
Figure 8
Figure 8. Results of running PhyloGibbs-MP, in module-prediction mode, on the 8 kb sequence upstream of the eve gene in Drosophila.
When run without priors, predictions lie on or close to all four annotated modules in this region from the REDfly database. When weight matrices for the gap transcription factors are used as priors, PhyloGibbs-MP fails to find the proximal promoter, but the stripe 2 and stripes 3+7 enhancers are detected with increased confidence. Predicted sites for individual motifs, as well as cumulative predictions over all motifs, are shown.
Figure 9
Figure 9. Performance of PhyloGibbs-MP with various parameter settings (with flyreg priors or without priors, and with 1, 2 or 4 orthologous aligned sequences), on detecting known cis-regulatory modules in regulatory regions of fly.
Figure 10
Figure 10. Performance of PhyloGibbs-MP (with flyreg priors, and 2 species) in detecting known CRMs in fly, compared with four other module finders.
Dotted lines indicate the performance expected if programs made predictions at random (that is, if, for each input sequence, the same number of site predictions were made but at random locations). Note that, in this data, 816457 bp out of 2448515 bp is in annotated CRMs; so a completely random program would exhibit roughly a specificity of 0.33, in agreement with the dotted lines at high sensitivity.

Similar articles

Cited by

References

    1. Stormo GD, Hartzell GW. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989;86:1183–1187. - PMC - PubMed
    1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
    1. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28–36. - PubMed
    1. Siddharthan R, Siggia ED, van Nimwegen E. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol. 2005;1:e67. doi:10.1371/journal.pcbi.0010067. - PMC - PubMed
    1. Berman B, Pfeiffer B, Laverty T, Salzberg S, Rubin G, et al. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. 2004;5:R61. - PMC - PubMed

Publication types

MeSH terms