. 2008 Mar 26;3(3):e1820.

doi: 10.1371/journal.pone.0001820.

Probabilistic inference of transcription factor binding from multiple data sources

Harri Lähdesmäki¹, Alistair G Rust, Ilya Shmulevich

Affiliations

PMID: 18364997
PMCID: PMC2268002
DOI: 10.1371/journal.pone.0001820

Probabilistic inference of transcription factor binding from multiple data sources

Harri Lähdesmäki et al. PLoS One. 2008.

. 2008 Mar 26;3(3):e1820.

doi: 10.1371/journal.pone.0001820.

Authors

Harri Lähdesmäki¹, Alistair G Rust, Ilya Shmulevich

Affiliation

¹ Institute for Systems Biology, Seattle, Washington, United States of America.

PMID: 18364997
PMCID: PMC2268002
DOI: 10.1371/journal.pone.0001820

Abstract

An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. An illustration of four different binding site configurations for a TF that is associated with two motif models (blue and green boxes).**
The diagram illustrates the upstream promoter region for a gene, where the direction of transcription is indicated by the direction of the arrows. The arrows are located at the transcription start sites.

**Figure 2. The standard ROC curves for the basic likelihood-based method with varying Markovian background model orders, d∈{0,1,…,4}.**

**Figure 3. ROC curves for the likelihood and Bayesian probabilistic methods with varying prior strengths.**
(a) M = 50. (b) M = 100. Results for (c) likelihood-based and (d) Bayesian methods for various values of M.

**Figure 4. Histograms of the estimated binding probabilities for the likelihood and Bayesian methods with varying prior strengths.**
(a) Likelihood M = 50. (b) Bayesian M = 50. (c) Likelihood M = 100. (d) Bayesian M = 100. x-axes correspond to the estimated binding probability and y-axes show the fraction of negative (blue) and positive (red) test cases. Histogram bin edges are located at , i = 0,1,…,10, although the two histograms are shown side by side.

formula image — **Figure 4. Histograms of the estimated binding probabilities for the likelihood and Bayesian methods with varying prior strengths.**
(a) Likelihood M = 50. (b) Bayesian M = 50. (c) Likelihood M = 100. (d) Bayesian M = 100. x-axes correspond to the estimated binding probability and y-axes show the fraction of negative (blue) and positive (red) test cases. Histogram bin edges are located at , i = 0,1,…,10, although the two histograms are shown side by side.

**Figure 5. ROC curves for the likelihood-based probabilistic method (red), traditional scanning (blue), and a probabilistic scanning-based method that outputs a probability of binding (green).**
The background model order is (a) d = 0 and (b) d = 1.

**Figure 6. An illustration of the data fusion for TF binding prediction.**
(a) Annotated binding sites for SRF on Actc1 promoter. (b) Annotated binding site for SRF on M23768 promoter. (c) Annotated binding site for SP1 on Myod1 promoter. (d) Annotated binding site for TEAD1 on Myh6 promoter. Figure keys are as follows. θ⁽ⁱ⁾: motif models for each TF, Conserv.: sequence conservation probabilities computed by PhastCons , Nuc. pos.: nucleosome occupancy probabilities estimated by a yeast nucleosome model from , and Reg. pot.: regulatory potential log-likelihood scores from . The additional evidences range between 0 and 1. Promoters sequence lengths are 2000 base pairs in (a), (c) and (d), and 500 base pairs in (b). See text for more details.

**Figure 7. (a) ROC curves for the likelihood-based method (blue) when combined with a single additional information source: regulatory potential (red), and evolutionary conservation (green).**
Histograms of the estimated binding probabilities for the likelihood-based method when combined with (b) regulatory potential and (c) evolutionary conservation.

Figure 8. (a) ROC curve for the likelihood-based method (blue) when combined with evolutionary conservation (green), regulatory potential (cyan), and a combination of evolutionary conservation and regulatory potential (red).
(b) Histogram of the estimated binding probabilities for a combination of conservation and regulatory potential.

Figure 9. ROC curves for the traditional scanning (green), traditional scanning combined with thresholded conservation information (blue), probabilistic method combined with conservation information (red), and probabilistic method (cyan).

**Figure 10. (a) ROC curves for combinatorial regulation using the Bayesian method (blue) and a naive likelihood approximation (green).**
Histogram of combinatorial regulation probabilities for (b) the Bayesian method and (c) naive likelihood approximation.

**Figure 11. (a) ROC curves for combinatorial regulation using the Bayesian method with evolutionary conservation (blue) and a naive likelihood approximation with evolutionary conservation (green).**
Histogram of combinatorial regulation probabilities for (b) the Bayesian method with evolutionary conservation and (c) a naive likelihood approximation with evolutionary conservation.

Figure 12. ROC curves for the likelihood-based method (blue) when both strands of the DNA are used and a single additional information sources is available: regulatory potential (red) and evolutionary conservation (green).

Figure 13. Estimated binding probabilities on a single base pair resolution for SRF on (a) the Actc1 and (c) M23768 promoters, (e) SP1 on the Myod1 promoter, and (g) TEAD1 on the Myh6 promoter without any additional information.
Subplots (b), (d), (f) and (h) show the same results but with evolutionary conservation as the additional data source. The blue and red graphs indicate the start of the binding sites. The annotated binding sites are shown with gray vertical bars. These results correspond to Figure 6.

Figure 14. Histogram of (a) the estimated binding probabilities, (b) maximum a posteriori (MAP) number of binding sites, and (c) the expected number of binding sites over all 5.4 million TF-promoter pairs.
Histogram frequency at bin value 10 in Figure (b) (resp. value about 5 in Figure (c)) includes all values that exceed 10 (resp. 5).

**Figure 15. Histogram of the estimated average binding probabilities over (a) different TFs and (b) promoter sequences.**

See this image and copyright information in PMC

Cited by

An integrative computational systems biology approach identifies differentially regulated dynamic transcriptome signatures which drive the initiation of human T helper cell differentiation.
Aijö T, Edelman SM, Lönnberg T, Larjo A, Kallionpää H, Tuomela S, Engström E, Lahesmaa R, Lähdesmäki H. Aijö T, et al. BMC Genomics. 2012 Oct 30;13:572. doi: 10.1186/1471-2164-13-572. BMC Genomics. 2012. PMID: 23110343 Free PMC article.
Mechanisms and evolution of control logic in prokaryotic transcriptional regulation.
van Hijum SA, Medema MH, Kuipers OP. van Hijum SA, et al. Microbiol Mol Biol Rev. 2009 Sep;73(3):481-509, Table of Contents. doi: 10.1128/MMBR.00037-08. Microbiol Mol Biol Rev. 2009. PMID: 19721087 Free PMC article. Review.
A protein-protein interaction guided method for competitive transcription factor binding improves target predictions.
Laurila K, Yli-Harja O, Lähdesmäki H. Laurila K, et al. Nucleic Acids Res. 2009 Dec;37(22):e146. doi: 10.1093/nar/gkp789. Nucleic Acids Res. 2009. PMID: 19786498 Free PMC article.
Epigenetic priors for identifying active transcription factor binding sites.
Cuellar-Partida G, Buske FA, McLeay RC, Whitington T, Noble WS, Bailey TL. Cuellar-Partida G, et al. Bioinformatics. 2012 Jan 1;28(1):56-62. doi: 10.1093/bioinformatics/btr614. Epub 2011 Nov 8. Bioinformatics. 2012. PMID: 22072382 Free PMC article.
Increasing coverage of transcription factor position weight matrices through domain-level homology.
Bernard B, Thorsson V, Rovira H, Shmulevich I. Bernard B, et al. PLoS One. 2012;7(8):e42779. doi: 10.1371/journal.pone.0042779. Epub 2012 Aug 27. PLoS One. 2012. PMID: 22952610 Free PMC article.

See all "Cited by" articles

References

1. Davidson EH. Genomic Regulatory Systems: Development and Evolution. Academic Press, 1st edition; 2001.
1. MacIsaac KD, Fraenkel E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computational Biology. 2006;2:e36. - PMC - PubMed
1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology. 2005;23:137–144. - PubMed
1. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology. 2000;296:1205–1214. - PubMed
1. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Probabilistic inference of transcription factor binding from multiple data sources

Affiliation

Probabilistic inference of transcription factor binding from multiple data sources

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous