Review

. 2007 Sep 27;8 Suppl 6(Suppl 6):S4.

doi: 10.1186/1471-2105-8-S6-S4.

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Erik van Nimwegen¹

Affiliations

PMID: 17903285
PMCID: PMC1995539
DOI: 10.1186/1471-2105-8-S6-S4

Review

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Erik van Nimwegen. BMC Bioinformatics. 2007.

. 2007 Sep 27;8 Suppl 6(Suppl 6):S4.

doi: 10.1186/1471-2105-8-S6-S4.

Author

Erik van Nimwegen¹

Affiliation

¹ Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, Klingelbergstrasse 50/70, Basel, Switzerland. erik.vannimwegan@unibas.ch

PMID: 17903285
PMCID: PMC1995539
DOI: 10.1186/1471-2105-8-S6-S4

Abstract

Over the last two decades a large number of algorithms has been developed for regulatory motif finding. Here we show how many of these algorithms, especially those that model binding specificities of regulatory factors with position specific weight matrices (WMs), naturally arise within a general Bayesian probabilistic framework. We discuss how WMs are constructed from sets of regulatory sites, how sites for a given WM can be discovered by scanning of large sequences, how to cluster WMs, and more generally how to cluster large sets of sites from different WMs into clusters. We discuss how 'regulatory modules', clusters of sites for subsets of WMs, can be found in large intergenic sequences, and we discuss different methods for ab initio motif finding, including expectation maximization (EM) algorithms, and motif sampling algorithms. Finally, we extensively discuss how module finding methods and ab initio motif finding methods can be extended to take phylogenetic relations between the input sequences into account, i.e. we show how motif finding and phylogenetic footprinting can be integrated in a rigorous probabilistic framework. The article is intended for readers with a solid background in applied mathematics, and preferably with some knowledge of general Bayesian probabilistic methods. The main purpose of the article is to elucidate that all these methods are not a disconnected set of individual algorithmic recipes, but that they are just different facets of a single integrated probabilistic theory.

PubMed Disclaimer

Figures

**Figure 1**
A dataset D consisting of a single sequence s of length L, with a single site hypothesized immediately after position i.

**Figure 2**
**A configuration i with 3 hypothesizes sites.**S_idenotes the set of hypothesized sites and B_ithe background bases.

**Figure 3**
**Illustration of equation (16)**. The black rectangle indicates the sum F_nof probabilities P(D|i) for all binding site configurations i for the sequence within the rectangle. Any configuration in F_nis obtained either through adding a single background base at n to any of the configurations in F_{n - 1}, or by adding a site from n - l + 1 through n to any configuration in F_{n - l}.

**Figure 4**
**Illustration of the steps of the Gibbs sampling algorithm.** The red profile indicates the posterior probability P(i_m|D, i^-) and in the last step a new position is sampled from this distribution.

**Figure 5**
**Illustration of a general configuration with varying site numbers for multiple motifs (upper left) and examples of moves used to sample all possible configurations.** In 1 a randomly chosen segment is 'recolored', leaving it either blank (background), coloring with any of the existing motifs, or coloring it with a new color (new motif). In 2 a colored segment is chosen at random and moved to another location. In 3 all segments in a motif are shifted by the same amount.

**Figure 6**
**Illustration of the move-set for binding site clustering.** Starting from a configuration C with three clusters, the top sequence in the blue cluster is chosen for resampling. It is removed from its cluster to produce configuration C^-. Probabilities are then calculated for all configurations that would be obtained by inserting the sequence into any of the clusters or a new cluster (gray sequences), and finally one of these (C') is sampled. In this example the sequence was placed in a new cluster. For illustration purposes we have assumed all sequences in D have precisely the length l of the hypothesized site, so that each sequence can only be aligned in one way with any cluster. In general the sequences in D will be longer than l and one would also sample over all ways that the sequence can be aligned with each of the clusters.

**Figure 7**
**The evolution of a set of orthologous bases along a phylogenetic tree.** In the left panel the expression (77) is illustrated. For notational simplicity we write P_αβfor P_αβ(w, t). The middle panel illustrates the recurion relations (78) with c and c' the children of node n, S_cthe set of bases in S that descend from c and S_c'the set of bases in S that descend from c'. The right panel shows expression (77) for a more complex selection pattern with branches evolving according to the WM in red, and those evolving to the background in black.

**Figure 8**
**Probability for an alignment block assuming a site occurs in the reference sequence.** In the top right an alignment segment S_[i,l]is shown for the species *S. cerevisiae* (the reference), *S. paradoxus*, *S. mikatae*, and *S. bayanus*. First we check which sequences are gaplessly aligned with the reference. In this case *S. mikatae* contains a gap and the background model is assigned to this sequence. The reference has the WM model assigned by default (indicated in red). In the left the probabilities of the sequences from *S. paradoxus* and *S. bayanus* are compared with the WM (shown as a logo). It turns out the *S. paradoxus* sequence scores better for the WM than for background but the *S. bayanus* sequence scores better to background than to the WM, because of some mismatches to the WM consensus (bases in purple). Finally, on the bottom right the phylogenetic tree is indicated with the branches that evolve according to the WM in red, and those evolving according to the background in black.

**Figure 9**
**An input data-set consisting of the multiple alignments of 3 sets of orthologous intergenic regions from *S. cerevisiae, S. paradoxus, S. mikatae*, and *S. bayanus*.** A binding site configuration c with sites for three motifs (red, green, and blue) is indicated. Note that each site is extended over all sequences that are locally gaplessly aligned. Most columns in the data are scored according to the background model in this configuration. On the lower right one example of an aligment column S' that is scored according to the background is shown. On the lower left the alignment S_wof sequences assigned to the red motif w is shown. A single column from this alignment consists of two independent columns, S and S˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacaaaa@2DEA@, that derive from the multiple alignments of intergenic regions 2 and 3 respectively. The trees on the left show that under this configuration, the columns S and S˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacaaaa@2DEA@ are both assumed to have evolved according to the same WM w, as indicated by the red branches on their phylogenetic trees T and T˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGubavgaacaaaa@2DEC@.

See this image and copyright information in PMC

References

1. Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987;193:723–750. doi: 10.1016/0022-2836(87)90354-8. - DOI - PubMed
1. Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P. High-throughput SELEX-SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol. 2002;20:831–835. - PubMed
1. Benos PV, Bulyk ML, Stormo GD. Additivity in protein-DNA interactions: how good an approximation is it? Nucl acids res. 2002;30:4442–4451. doi: 10.1093/nar/gkf578. - DOI - PMC - PubMed
1. Djordjevic M, Sengupta AM, Shraiman BI. A Biophysical approach to Transcription Factor Binding Site Discovery. Genome Research. 2003;13:2381–2390. doi: 10.1101/gr.1271603. - DOI - PMC - PubMed
1. Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, Phillips R. Transcriptional regulation by the numbers: models. Curr Opin Genet Dev. 2005;15:116–124. doi: 10.1016/j.gde.2005.02.007. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Affiliation

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Author

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources