Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Mar 26;3(3):e1820.
doi: 10.1371/journal.pone.0001820.

Probabilistic inference of transcription factor binding from multiple data sources

Affiliations

Probabilistic inference of transcription factor binding from multiple data sources

Harri Lähdesmäki et al. PLoS One. .

Abstract

An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. An illustration of four different binding site configurations for a TF that is associated with two motif models (blue and green boxes).
The diagram illustrates the upstream promoter region for a gene, where the direction of transcription is indicated by the direction of the arrows. The arrows are located at the transcription start sites.
Figure 2
Figure 2. The standard ROC curves for the basic likelihood-based method with varying Markovian background model orders, d∈{0,1,…,4}.
Figure 3
Figure 3. ROC curves for the likelihood and Bayesian probabilistic methods with varying prior strengths.
(a) M = 50. (b) M = 100. Results for (c) likelihood-based and (d) Bayesian methods for various values of M.
Figure 4
Figure 4. Histograms of the estimated binding probabilities for the likelihood and Bayesian methods with varying prior strengths.
(a) Likelihood M = 50. (b) Bayesian M = 50. (c) Likelihood M = 100. (d) Bayesian M = 100. x-axes correspond to the estimated binding probability and y-axes show the fraction of negative (blue) and positive (red) test cases. Histogram bin edges are located at formula image, i = 0,1,…,10, although the two histograms are shown side by side.
Figure 5
Figure 5. ROC curves for the likelihood-based probabilistic method (red), traditional scanning (blue), and a probabilistic scanning-based method that outputs a probability of binding (green).
The background model order is (a) d = 0 and (b) d = 1.
Figure 6
Figure 6. An illustration of the data fusion for TF binding prediction.
(a) Annotated binding sites for SRF on Actc1 promoter. (b) Annotated binding site for SRF on M23768 promoter. (c) Annotated binding site for SP1 on Myod1 promoter. (d) Annotated binding site for TEAD1 on Myh6 promoter. Figure keys are as follows. θ(i): motif models for each TF, Conserv.: sequence conservation probabilities computed by PhastCons , Nuc. pos.: nucleosome occupancy probabilities estimated by a yeast nucleosome model from , and Reg. pot.: regulatory potential log-likelihood scores from . The additional evidences range between 0 and 1. Promoters sequence lengths are 2000 base pairs in (a), (c) and (d), and 500 base pairs in (b). See text for more details.
Figure 7
Figure 7. (a) ROC curves for the likelihood-based method (blue) when combined with a single additional information source: regulatory potential (red), and evolutionary conservation (green).
Histograms of the estimated binding probabilities for the likelihood-based method when combined with (b) regulatory potential and (c) evolutionary conservation.
Figure 8
Figure 8. (a) ROC curve for the likelihood-based method (blue) when combined with evolutionary conservation (green), regulatory potential (cyan), and a combination of evolutionary conservation and regulatory potential (red).
(b) Histogram of the estimated binding probabilities for a combination of conservation and regulatory potential.
Figure 9
Figure 9. ROC curves for the traditional scanning (green), traditional scanning combined with thresholded conservation information (blue), probabilistic method combined with conservation information (red), and probabilistic method (cyan).
Figure 10
Figure 10. (a) ROC curves for combinatorial regulation using the Bayesian method (blue) and a naive likelihood approximation (green).
Histogram of combinatorial regulation probabilities for (b) the Bayesian method and (c) naive likelihood approximation.
Figure 11
Figure 11. (a) ROC curves for combinatorial regulation using the Bayesian method with evolutionary conservation (blue) and a naive likelihood approximation with evolutionary conservation (green).
Histogram of combinatorial regulation probabilities for (b) the Bayesian method with evolutionary conservation and (c) a naive likelihood approximation with evolutionary conservation.
Figure 12
Figure 12. ROC curves for the likelihood-based method (blue) when both strands of the DNA are used and a single additional information sources is available: regulatory potential (red) and evolutionary conservation (green).
Figure 13
Figure 13. Estimated binding probabilities on a single base pair resolution for SRF on (a) the Actc1 and (c) M23768 promoters, (e) SP1 on the Myod1 promoter, and (g) TEAD1 on the Myh6 promoter without any additional information.
Subplots (b), (d), (f) and (h) show the same results but with evolutionary conservation as the additional data source. The blue and red graphs indicate the start of the binding sites. The annotated binding sites are shown with gray vertical bars. These results correspond to Figure 6.
Figure 14
Figure 14. Histogram of (a) the estimated binding probabilities, (b) maximum a posteriori (MAP) number of binding sites, and (c) the expected number of binding sites over all 5.4 million TF-promoter pairs.
Histogram frequency at bin value 10 in Figure (b) (resp. value about 5 in Figure (c)) includes all values that exceed 10 (resp. 5).
Figure 15
Figure 15. Histogram of the estimated average binding probabilities over (a) different TFs and (b) promoter sequences.

Similar articles

Cited by

References

    1. Davidson EH. Genomic Regulatory Systems: Development and Evolution. Academic Press, 1st edition; 2001.
    1. MacIsaac KD, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology. 2006;2:e36. - PMC - PubMed
    1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology. 2005;23:137–144. - PubMed
    1. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology. 2000;296:1205–1214. - PubMed
    1. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. - PubMed

Publication types

Substances