Variations on probabilistic suffix trees: statistical modeling and prediction of protein families
- PMID: 11222260
- DOI: 10.1093/bioinformatics/17.1.23
Variations on probabilistic suffix trees: statistical modeling and prediction of protein families
Abstract
Motivation: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance.
Results: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.
Similar articles
-
SVM-based detection of distant protein structural relationships using pairwise probabilistic suffix trees.Comput Biol Chem. 2006 Aug;30(4):292-9. doi: 10.1016/j.compbiolchem.2006.05.001. Comput Biol Chem. 2006. PMID: 16880118
-
A generalization of the PST algorithm: modeling the sparse nature of protein sequences.Bioinformatics. 2006 Jun 1;22(11):1302-7. doi: 10.1093/bioinformatics/btl088. Epub 2006 Mar 9. Bioinformatics. 2006. PMID: 16527830
-
Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model.BMC Bioinformatics. 2004 Oct 25;5:157. doi: 10.1186/1471-2105-5-157. BMC Bioinformatics. 2004. PMID: 15504234 Free PMC article.
-
Statistical significance in biological sequence analysis.Brief Bioinform. 2006 Mar;7(1):2-24. doi: 10.1093/bib/bbk001. Brief Bioinform. 2006. PMID: 16761361 Review.
-
Pairwise sequence alignment--it's all about us!Brief Bioinform. 2006 Mar;7(1):113-5. doi: 10.1093/bib/bbk008. Brief Bioinform. 2006. PMID: 16761368 Review. No abstract available.
Cited by
-
Stochastic computing with biomolecular automata.Proc Natl Acad Sci U S A. 2004 Jul 6;101(27):9960-5. doi: 10.1073/pnas.0400731101. Epub 2004 Jun 23. Proc Natl Acad Sci U S A. 2004. PMID: 15215499 Free PMC article.
-
Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences.Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3931-9. doi: 10.1098/rstb.2008.0167. Philos Trans R Soc Lond B Biol Sci. 2008. PMID: 18852105 Free PMC article.
-
Comparison of imputation methods for univariate categorical longitudinal data.Qual Quant. 2025;59(2):1767-1791. doi: 10.1007/s11135-024-02028-z. Epub 2024 Dec 26. Qual Quant. 2025. PMID: 40433560 Free PMC article.
-
TransportTP: a two-phase classification approach for membrane transporter prediction and characterization.BMC Bioinformatics. 2009 Dec 14;10:418. doi: 10.1186/1471-2105-10-418. BMC Bioinformatics. 2009. PMID: 20003433 Free PMC article.
-
Local similarity search to find gene indicators in mitochondrial genomes.Biology (Basel). 2014 Mar 11;3(1):220-42. doi: 10.3390/biology3010220. Biology (Basel). 2014. PMID: 24833343 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials