Predicting gene expression from sequence: a reexamination

Yuan Yuan¹, Lei Guo, Lei Shen, Jun S Liu

Affiliations

PMID: 18052544
PMCID: PMC2098866
DOI: 10.1371/journal.pcbi.0030243

Comparative Study

Predicting gene expression from sequence: a reexamination

Yuan Yuan et al. PLoS Comput Biol. 2007 Nov.

. 2007 Nov;3(11):e243.

doi: 10.1371/journal.pcbi.0030243.

Authors

Yuan Yuan¹, Lei Guo, Lei Shen, Jun S Liu

Affiliation

¹ Department of Statistics, Harvard University, Cambridge, Massachusetts, United States of America.

PMID: 18052544
PMCID: PMC2098866
DOI: 10.1371/journal.pcbi.0030243

Abstract

Although much of the information regarding genes' expressions is encoded in the genome, deciphering such information has been very challenging. We reexamined Beer and Tavazoie's (BT) approach to predict mRNA expression patterns of 2,587 genes in Saccharomyces cerevisiae from the information in their respective promoter sequences. Instead of fitting complex Bayesian network models, we trained naïve Bayes classifiers using only the sequence-motif matching scores provided by BT. Our simple models correctly predict expression patterns for 79% of the genes, based on the same criterion and the same cross-validation (CV) procedure as BT, which compares favorably to the 73% accuracy of BT. The fact that our approach did not use position and orientation information of the predicted binding sites but achieved a higher prediction accuracy, motivated us to investigate a few biological predictions made by BT. We found that some of their predictions, especially those related to motif orientations and positions, are at best circumstantial. For example, the combinatorial rules suggested by BT for the PAC and RRPE motifs are not unique to the cluster of genes from which the predictive model was inferred, and there are simpler rules that are statistically more significant than BT's ones. We also show that CV procedure used by BT to estimate their method's prediction accuracy is inappropriate and may have overestimated the prediction accuracy by about 10%.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. Training and Test Set Classification Accuracy for Naïve Bayes Method Using Motif Scores Only**
Classification accuracies for training sets increases with the number of top motifs selected in models, while test set accuracies only increase when model sizes are small. Including too many features will overfit the training set and thus decrease the test set accuracies. 100 random repeats of 5-fold CVs were performed, and the curves display the mean accuracies. The error bars denote the maximum and minimum accuracy achieved in the 100 random repeats.

**Figure 2. Motif Logos of M198 and RAP1**
These two TFBMs are very similar, except that M198 is one position longer than RAP1 on the right end. Compared to RAP1, M198 can help distinguish genes in cluster 1 from other genes in a higher statistical significance, without using any position or orientation constraints.

See this image and copyright information in PMC

References

1. MacIsaac KD, Fraenkel E. Practical strategies for discovering regulatory dna sequence motifs. PLoS Comput Biol. 2006;2:e36. doi: 10.1371/journal.pcbi.0020036. - DOI - PMC - PubMed
1. Jensen ST, Shen L, Liu JS. Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes. Bioinformatics. 2005;21:3832–3839. - PubMed
1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. - PubMed
1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al. Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
1. Neuwald AF, Liu JS, Lawrence CE. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci. 1995;4:1618–1632. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting gene expression from sequence: a reexamination

Affiliation

Predicting gene expression from sequence: a reexamination

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases