Integrating genomic data to predict transcription factor binding
- PMID: 16362910
Integrating genomic data to predict transcription factor binding
Abstract
Transcription factor binding sites (TFBS) in gene promoter regions are often predicted by using position specific scoring matrices (PSSMs), which summarize sequence patterns of experimentally determined TF binding sites. Although PSSMs are more reliable than simple consensus string matching in predicting a true binding site, they generally result in high numbers of false positive hits. This study attempts to reduce the number of false positive matches and generate new predictions by integrating various types of genomic data by two methods: a Bayesian allocation procedure, and support vector machine classification. Several methods will be explored to strengthen the prediction of a true TFBS in the Saccharomyces cerevisiae genome: binding site degeneracy, binding site conservation, phylogenetic profiling, TF binding site clustering, gene expression profiles, GO functional annotation, and k-mer counts in promoter regions. Binding site degeneracy (or redundancy) refers to the number of times a particular transcription factor's binding motif is discovered in the upstream region of a gene. Phylogenetic conservation takes into account the number of orthologous upstream regions in other genomes that contain a particular binding site. Phylogenetic profiling refers to the presence or absence of a gene across a large set of genomes. Binding site clusters are statistically significant clusters of TF binding sites detected by the algorithm ClusterBuster. Gene expression takes into account the idea that when the gene expression profiles of a transcription factor and a potential target gene are correlated, then it is more likely that the gene is a genuine target. Also, genes with highly correlated expression profiles are often regulated by the same TF(s). The GO annotation data takes advantage of the idea that common transcription targets often have related function. Finally, the distribution of the counts of all k-mers of length 4, 5, and 6 in gene's promoter region were examined as means to predict TF binding. In each case the data are compared to known true positives taken from ChIP-chip data, Transfac, and the Saccharomyces Genome Database. First, degeneracy, conservation, expression, and binding site clusters were examined independently and in combination via Bayesian allocation. Then, binding sites were predicted with a support vector machine (SVM) using all methods alone and in combination. The SVM works best when all genomic data are combined, but can also identify which methods contribute the most to accurate classification. On average, a support vector machine can classify binding sites with high sensitivity and an accuracy of almost 80%.
Similar articles
-
PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny.PLoS Comput Biol. 2005 Dec;1(7):e67. doi: 10.1371/journal.pcbi.0010067. Epub 2005 Dec 9. PLoS Comput Biol. 2005. PMID: 16477324 Free PMC article.
-
Assessing transcription factor motif drift from noisy decoy sequences.Genome Inform. 2005;16(1):59-67. Genome Inform. 2005. PMID: 16362907
-
Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm.J Mol Biol. 2002 Apr 19;318(1):71-81. doi: 10.1016/S0022-2836(02)00026-8. J Mol Biol. 2002. PMID: 12054769
-
Transcriptional networks: reverse-engineering gene regulation on a global scale.Curr Opin Microbiol. 2004 Dec;7(6):638-46. doi: 10.1016/j.mib.2004.10.009. Curr Opin Microbiol. 2004. PMID: 15556037 Review.
-
Prediction of cis-regulatory elements using binding site matrices--the successes, the failures and the reasons for both.Curr Opin Genet Dev. 2005 Aug;15(4):395-402. doi: 10.1016/j.gde.2005.05.002. Curr Opin Genet Dev. 2005. PMID: 15950456 Review.
Cited by
-
Probabilistic inference of transcription factor binding from multiple data sources.PLoS One. 2008 Mar 26;3(3):e1820. doi: 10.1371/journal.pone.0001820. PLoS One. 2008. PMID: 18364997 Free PMC article.
-
Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells.Front Genet. 2016 Feb 23;7:24. doi: 10.3389/fgene.2016.00024. eCollection 2016. Front Genet. 2016. PMID: 26941778 Free PMC article. Review.
-
Machine learning: its challenges and opportunities in plant system biology.Appl Microbiol Biotechnol. 2022 May;106(9-10):3507-3530. doi: 10.1007/s00253-022-11963-6. Epub 2022 May 16. Appl Microbiol Biotechnol. 2022. PMID: 35575915 Review.
-
In silico regulatory analysis for exploring human disease progression.Biol Direct. 2008 Jun 18;3:24. doi: 10.1186/1745-6150-3-24. Biol Direct. 2008. PMID: 18564415 Free PMC article.
-
Uncovering the transcriptional circuitry in skeletal muscle regeneration.Mamm Genome. 2011 Jun;22(5-6):272-81. doi: 10.1007/s00335-011-9322-x. Epub 2011 Apr 21. Mamm Genome. 2011. PMID: 21509518
Publication types
MeSH terms
Substances
LinkOut - more resources
Molecular Biology Databases
Miscellaneous