PSI-BLAST pseudocounts and the minimum description length principle
- PMID: 19088134
- PMCID: PMC2647318
- DOI: 10.1093/nar/gkn981
PSI-BLAST pseudocounts and the minimum description length principle
Abstract
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
Figures















Similar articles
-
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.Bioinformatics. 1999 Dec;15(12):1000-11. doi: 10.1093/bioinformatics/15.12.1000. Bioinformatics. 1999. PMID: 10745990
-
MulPSSM: a database of multiple position-specific scoring matrices of protein domain families.Nucleic Acids Res. 2006 Jan 1;34(Database issue):D243-6. doi: 10.1093/nar/gkj043. Nucleic Acids Res. 2006. PMID: 16381855 Free PMC article.
-
Large-scale comparison of protein sequence alignment algorithms with structure alignments.Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7. Proteins. 2000. PMID: 10813826
-
Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment.Int J Comput Biol Drug Des. 2008;1(4):347-67. doi: 10.1504/ijcbdd.2008.022207. Int J Comput Biol Drug Des. 2008. PMID: 20063463 Review.
-
Sequence Similarity Searching.Curr Protoc Protein Sci. 2019 Feb;95(1):e71. doi: 10.1002/cpps.71. Epub 2018 Aug 13. Curr Protoc Protein Sci. 2019. PMID: 30102464 Review.
Cited by
-
Biodefense Oriented Genomic-Based Pathogen Classification Systems: Challenges and Opportunities.J Bioterror Biodef. 2012 Mar 16;3(1):1000113. doi: 10.4172/2157-2526.1000113. J Bioterror Biodef. 2012. PMID: 25587492 Free PMC article.
-
PSI-Search: iterative HOE-reduced profile SSEARCH searching.Bioinformatics. 2012 Jun 15;28(12):1650-1. doi: 10.1093/bioinformatics/bts240. Epub 2012 Apr 25. Bioinformatics. 2012. PMID: 22539666 Free PMC article.
-
Evidence for Light and Tissue Specific Regulation of Genes Involved in Fructan Metabolism in Agave tequilana.Plants (Basel). 2022 Aug 19;11(16):2153. doi: 10.3390/plants11162153. Plants (Basel). 2022. PMID: 36015458 Free PMC article.
-
Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance.BMC Bioinformatics. 2017 Jun 2;18(1):288. doi: 10.1186/s12859-017-1686-9. BMC Bioinformatics. 2017. PMID: 28578660 Free PMC article.
-
dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity.Biol Direct. 2015 Aug 1;10:39. doi: 10.1186/s13062-015-0068-3. Biol Direct. 2015. PMID: 26228544 Free PMC article.
References
-
- Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. - PubMed
-
- Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure. Vol. 5. Washington, DC: National Biomedical Research Foundation; 1978. pp. 345–352.
-
- Schwartz RM, Dayhoff MO. Matrices for detecting distant relationships. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure. Vol. 5. Washington, DC: National Biomedical Research Foundation; 1978. pp. 353–358.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials