Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2007 Sep 27;8 Suppl 6(Suppl 6):S4.
doi: 10.1186/1471-2105-8-S6-S4.

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Affiliations
Review

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Erik van Nimwegen. BMC Bioinformatics. .

Abstract

Over the last two decades a large number of algorithms has been developed for regulatory motif finding. Here we show how many of these algorithms, especially those that model binding specificities of regulatory factors with position specific weight matrices (WMs), naturally arise within a general Bayesian probabilistic framework. We discuss how WMs are constructed from sets of regulatory sites, how sites for a given WM can be discovered by scanning of large sequences, how to cluster WMs, and more generally how to cluster large sets of sites from different WMs into clusters. We discuss how 'regulatory modules', clusters of sites for subsets of WMs, can be found in large intergenic sequences, and we discuss different methods for ab initio motif finding, including expectation maximization (EM) algorithms, and motif sampling algorithms. Finally, we extensively discuss how module finding methods and ab initio motif finding methods can be extended to take phylogenetic relations between the input sequences into account, i.e. we show how motif finding and phylogenetic footprinting can be integrated in a rigorous probabilistic framework. The article is intended for readers with a solid background in applied mathematics, and preferably with some knowledge of general Bayesian probabilistic methods. The main purpose of the article is to elucidate that all these methods are not a disconnected set of individual algorithmic recipes, but that they are just different facets of a single integrated probabilistic theory.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A dataset D consisting of a single sequence s of length L, with a single site hypothesized immediately after position i.
Figure 2
Figure 2
A configuration i with 3 hypothesizes sites.Si denotes the set of hypothesized sites and Bi the background bases.
Figure 3
Figure 3
Illustration of equation (16). The black rectangle indicates the sum Fn of probabilities P(D|i) for all binding site configurations i for the sequence within the rectangle. Any configuration in Fn is obtained either through adding a single background base at n to any of the configurations in Fn - 1, or by adding a site from n - l + 1 through n to any configuration in Fn - l.
Figure 4
Figure 4
Illustration of the steps of the Gibbs sampling algorithm. The red profile indicates the posterior probability P(im|D, i-) and in the last step a new position is sampled from this distribution.
Figure 5
Figure 5
Illustration of a general configuration with varying site numbers for multiple motifs (upper left) and examples of moves used to sample all possible configurations. In 1 a randomly chosen segment is 'recolored', leaving it either blank (background), coloring with any of the existing motifs, or coloring it with a new color (new motif). In 2 a colored segment is chosen at random and moved to another location. In 3 all segments in a motif are shifted by the same amount.
Figure 6
Figure 6
Illustration of the move-set for binding site clustering. Starting from a configuration C with three clusters, the top sequence in the blue cluster is chosen for resampling. It is removed from its cluster to produce configuration C-. Probabilities are then calculated for all configurations that would be obtained by inserting the sequence into any of the clusters or a new cluster (gray sequences), and finally one of these (C') is sampled. In this example the sequence was placed in a new cluster. For illustration purposes we have assumed all sequences in D have precisely the length l of the hypothesized site, so that each sequence can only be aligned in one way with any cluster. In general the sequences in D will be longer than l and one would also sample over all ways that the sequence can be aligned with each of the clusters.
Figure 7
Figure 7
The evolution of a set of orthologous bases along a phylogenetic tree. In the left panel the expression (77) is illustrated. For notational simplicity we write Pαβ for Pαβ(w, t). The middle panel illustrates the recurion relations (78) with c and c' the children of node n, Sc the set of bases in S that descend from c and Sc' the set of bases in S that descend from c'. The right panel shows expression (77) for a more complex selection pattern with branches evolving according to the WM in red, and those evolving to the background in black.
Figure 8
Figure 8
Probability for an alignment block assuming a site occurs in the reference sequence. In the top right an alignment segment S[i,l] is shown for the species S. cerevisiae (the reference), S. paradoxus, S. mikatae, and S. bayanus. First we check which sequences are gaplessly aligned with the reference. In this case S. mikatae contains a gap and the background model is assigned to this sequence. The reference has the WM model assigned by default (indicated in red). In the left the probabilities of the sequences from S. paradoxus and S. bayanus are compared with the WM (shown as a logo). It turns out the S. paradoxus sequence scores better for the WM than for background but the S. bayanus sequence scores better to background than to the WM, because of some mismatches to the WM consensus (bases in purple). Finally, on the bottom right the phylogenetic tree is indicated with the branches that evolve according to the WM in red, and those evolving according to the background in black.
Figure 9
Figure 9
An input data-set consisting of the multiple alignments of 3 sets of orthologous intergenic regions from S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus. A binding site configuration c with sites for three motifs (red, green, and blue) is indicated. Note that each site is extended over all sequences that are locally gaplessly aligned. Most columns in the data are scored according to the background model in this configuration. On the lower right one example of an aligment column S' that is scored according to the background is shown. On the lower left the alignment Sw of sequences assigned to the red motif w is shown. A single column from this alignment consists of two independent columns, S and S˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacaaaa@2DEA@, that derive from the multiple alignments of intergenic regions 2 and 3 respectively. The trees on the left show that under this configuration, the columns S and S˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacaaaa@2DEA@ are both assumed to have evolved according to the same WM w, as indicated by the red branches on their phylogenetic trees T and T˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGubavgaacaaaa@2DEC@.

Similar articles

Cited by

References

    1. Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987;193:723–750. doi: 10.1016/0022-2836(87)90354-8. - DOI - PubMed
    1. Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P. High-throughput SELEX-SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol. 2002;20:831–835. - PubMed
    1. Benos PV, Bulyk ML, Stormo GD. Additivity in protein-DNA interactions: how good an approximation is it? Nucl acids res. 2002;30:4442–4451. doi: 10.1093/nar/gkf578. - DOI - PMC - PubMed
    1. Djordjevic M, Sengupta AM, Shraiman BI. A Biophysical approach to Transcription Factor Binding Site Discovery. Genome Research. 2003;13:2381–2390. doi: 10.1101/gr.1271603. - DOI - PMC - PubMed
    1. Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, Phillips R. Transcriptional regulation by the numbers: models. Curr Opin Genet Dev. 2005;15:116–124. doi: 10.1016/j.gde.2005.02.007. - DOI - PMC - PubMed

LinkOut - more resources