Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
- PMID: 11452024
- PMCID: PMC55814
- DOI: 10.1093/nar/29.14.2994
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
Abstract
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
Figures




Similar articles
-
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.Bioinformatics. 1999 Dec;15(12):1000-11. doi: 10.1093/bioinformatics/15.12.1000. Bioinformatics. 1999. PMID: 10745990
-
Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches.Bioinformatics. 2008 Jun 1;24(11):1339-43. doi: 10.1093/bioinformatics/btn130. Epub 2008 Apr 10. Bioinformatics. 2008. PMID: 18403442
-
Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST.BMC Biol. 2006 Dec 7;4:41. doi: 10.1186/1741-7007-4-41. BMC Biol. 2006. PMID: 17156431 Free PMC article.
-
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389. Nucleic Acids Res. 1997. PMID: 9254694 Free PMC article. Review.
-
Identifying remote protein homologs by network propagation.FEBS J. 2005 Oct;272(20):5119-28. doi: 10.1111/j.1742-4658.2005.04947.x. FEBS J. 2005. PMID: 16218946 Review.
Cited by
-
A phylogenetic analysis of the globins in fungi.PLoS One. 2012;7(2):e31856. doi: 10.1371/journal.pone.0031856. Epub 2012 Feb 27. PLoS One. 2012. PMID: 22384087 Free PMC article.
-
Mycobacteriophage Alexphander Gene 94 Encodes an Essential dsDNA-Binding Protein during Lytic Infection.Int J Mol Sci. 2024 Jul 7;25(13):7466. doi: 10.3390/ijms25137466. Int J Mol Sci. 2024. PMID: 39000573 Free PMC article.
-
Domain enhanced lookup time accelerated BLAST.Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12. Biol Direct. 2012. PMID: 22510480 Free PMC article.
-
Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization.Life (Basel). 2021 Mar 30;11(4):293. doi: 10.3390/life11040293. Life (Basel). 2021. PMID: 33808227 Free PMC article.
-
Beyond BLASTing: tertiary and quaternary structure analysis helps identify major vault proteins.Genome Biol Evol. 2013;5(1):217-32. doi: 10.1093/gbe/evs135. Genome Biol Evol. 2013. PMID: 23275487 Free PMC article.
References
-
- Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
-
- Schäffer A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000–1011. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials