Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 Dec;1(7):e67.
doi: 10.1371/journal.pcbi.0010067. Epub 2005 Dec 9.

PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny

Affiliations
Comparative Study

PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny

Rahul Siddharthan et al. PLoS Comput Biol. 2005 Dec.

Abstract

A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and "background" intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Binding Site Configuration
A window, in our terminology, is a possible binding site for a TF; in the case of phylogenetically unrelated sequences it is simply a set of m contiguous bases in a sequence, with m the binding site width. This figure shows a configuration C containing a total of eight windows (rectangles) for three different WMs (red, blue, and green). Note that a single sequence of length L has Lm + 1 windows in it.
Figure 2
Figure 2. An Alignment of Four Sequences Showing Three Legitimate Windows and One Illegitimate Window
Vertically aligned capital letters are phylogenetically related bases, assumed to have evolved from a common ancestor. Thus, any window placed on these bases is extended to cover all related bases. Three legitimate windows are surrounded by solid boxes. The window surrounded by the dotted box is illegitimate because the gap in the top sequence makes the alignment of bases inconsistent. Note that lower case letters are not aligned and that, in order to complete a window with aligned sequences, one may slide lowercase bases “through” adjoining gaps. For example, if the window on the bottom two sequences were to move two steps to the left, the “c” and “a” on the left side of the preceding gaps would slide through the gaps to the right to complete the window.
Figure 3
Figure 3. Performance of PhyloGibbs and Non-Phylo Motif-Finding Algorithms on Alignments of Orthologous Intergenic Regions as a Function of the Evolutionary Proximity of the Orthologs and the Quality of the WM
PhyloGibbs with phylogeny (red), PhyloGibbs in non-phylo mode (light blue), WGibbs (dark blue), and MEME (pink) were run on alignments of S = 5 intergenic regions of length L = 500, each at a proximity q to the common ancestor and each containing s = 4 binding sites from a single WM of width w = 10. In the upper left panel, WMs had polarization p = 0.6, in the upper right p = 0.75, in the lower left p = 0.9, and in the lower right random WMs (drawn uniformly from the simplex) were used. The solid lines show the average overlaps between the predicted sites and the real sites, and the dotted lines show two standard errors (estimated from 50 different datasets generated with equal parameters for each data point).
Figure 4
Figure 4. Performance of PhyloGibbs in Recovering a Single Site of a Randomly Chosen WM of Width w = 10 from the Alignment of S Orthologous Intergenic Regions of Proximity q = 0.5 and Length L = 500 as a function of S
The solid line shows the average overlap between the true site and the predicted site and the dotted lines show two standard errors.
Figure 5
Figure 5. Performance of Several Motif-Finding Algorithms on Synthetic Data Prepared as for Figure 3
A total of 250 alignments of S = 5 orthologous intergenic regions of length L = 750 and proximity q = 0.5 were created with three binding sites sampled from each of three different random WMs. The left panel shows how the fraction of predicted sites that match true sites (specificity) depends on the fraction of true sites that are among the predictions (sensitivity) for PhyloGibbs (red), EMnEM (yellow), PhyME (green), PhyloGibbs without phylogeny (light blue), WGibbs (dark blue), and MEME (pink). Dashed lines correspond to two standard errors. The right panel shows the ability of the different algorithms to assess their own reliability. The true specificity is shown as a function of the specificity that the algorithm predicts for the sites that it reports. The black line y = x corresponds to a perfect assessment of the algorithm's reliability.
Figure 6
Figure 6. Performance of Several Motif-Finding Algorithms on 200 Alignments of Orthologous Intergenic Regions from Five Saccharomyces Species Containing Documented Binding Sites
The left panel shows how the fraction of predicted sites that match true sites (specificity) depends on the fraction of true sites that are among the predictions (sensitivity) for PhyloGibbs (red), EMnEM (yellow), PhyME (green), PhyloGibbs without phylogeny (light blue), WGibbs (dark blue), and MEME (pink). Dashed lines correspond to one standard error. In order for the specificities, predicted by the various algorithms, to match the true specificities, we have to assume that the known sites are only a fraction of all true sites. The right panel shows what the fraction of known sites among all true sites should be in order for the algorithms' predicted specificities to match the true specificities. The black line shows an independent estimate of the fraction of real sites in these upstream regions that is documented (see text).

References

    1. Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987;193:723–750. - PubMed
    1. Durbin R, Eddy S, Krogh G, Mitchison G. Cambridge University Press; 1998. Biological sequence analysis. 356 p.
    1. Djordjevic M, Sengupta AM, Shraiman BI. A biophysical approach to transcription factor binding site discovery. Genome Res. 2003;13:2381–2390. - PMC - PubMed
    1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
    1. Liu JS, Neuwald AF, Lawrence CE. Markovian structures in biological sequence alignment. J Am Stat Assoc. 1999;94 :1–15.

Publication types