. 2009 Feb;37(3):815-24.

doi: 10.1093/nar/gkn981. Epub 2008 Dec 16.

PSI-BLAST pseudocounts and the minimum description length principle

Stephen F Altschul¹, E Michael Gertz, Richa Agarwala, Alejandro A Schäffer, Yi-Kuo Yu

Affiliations

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov

PMID: 19088134
PMCID: PMC2647318
DOI: 10.1093/nar/gkn981

PSI-BLAST pseudocounts and the minimum description length principle

Stephen F Altschul et al. Nucleic Acids Res. 2009 Feb.

. 2009 Feb;37(3):815-24.

doi: 10.1093/nar/gkn981. Epub 2008 Dec 16.

Authors

Stephen F Altschul¹, E Michael Gertz, Richa Agarwala, Alejandro A Schäffer, Yi-Kuo Yu

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov

PMID: 19088134
PMCID: PMC2647318
DOI: 10.1093/nar/gkn981

Abstract

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.

PubMed Disclaimer

Figures

**Figure 1.**
Frequency distribution space. (a) The frequencies for the case of three amino acids can be represented by points inside an equilateral triangle. (b) A substitution matrix maps the vertices of the simplex, each of which corresponds to the observation of a single amino acid, to target frequencies in the interior of the simplex. (c) The linear transformation implied by a substitution matrix may be applied to the whole of frequency distribution space, mapping any vector of observed frequencies f to a pseudocount frequency vector g.

**Figure 2.**
Linear transformations of frequency distribution space. A substitution matrix imposes a linear transformation M that maps each observed frequency vector f to a pseudocount vector g, and all of frequency distribution space to the smallest simplex shown. For values of α between 0 and 1, the use of pseudocounts imposes a linear transformation that maps f to a point q on the line between f and g, and the frequency distribution space to the intermediate simplex shown.

formula image — **Figure 2.**
Linear transformations of frequency distribution space. A substitution matrix imposes a linear transformation M that maps each observed frequency vector f to a pseudocount vector g, and all of frequency distribution space to the smallest simplex shown. For values of α between 0 and 1, the use of pseudocounts imposes a linear transformation that maps f to a point q on the line between f and g, and the frequency distribution space to the intermediate simplex shown.

**Figure 3.**
The effect of pseudocounts on the number of independent theories. A theory describing the frequency distribution of three amino acids can be represented as a point within an equilateral triangle. The density of independent theories, which is proportional to (∏p_i)^{− 1/2}, is represented by the density of points within the triangle, and increases as one moves away from its center. Using pseudocounts confines the theories one may consider to points within a simplex inside the frequency distribution space. This simplex has a smaller volume than the complete space, and also has a smaller average density of independent theories.

**Figure 4.**
Decrease in model description length as a result of using pseudocounts implied by the BLOSUM-62 substitution matrix. For large n, one may calculate the decrease in the description length of the model for α between 0 and 1, compared to the description length of the model at . The total decrease can be decomposed into a decrease in simplex volume, and a decrease in independent theory density. Half as many independent theories corresponds to a decrease of one bit. Positive values on the y-axis represent decreases in model description length.

**Figure 5.**
Selecting an optimal proportion of pseudocounts using the MDL principle. For n = 500 and the observed frequencies f listed in Table 3, we apply pseudocounts as implied by the BLOSUM-62 substitution matrix. We use Equation (5) to compute the change in the description length of the data, when compared to the description length of the data at , for α between 0 and 0.1. The dot-dashed curve (in red) shows the increase in the description length of the data. The dashed curve (in blue) shows the decrease in the description length of the model. The total decrease in the description length, shown by the solid curve (in black), is maximized at , which corresponds to 19.4 pseudocounts.

**Figure 6.**
Optimal number of pseudocounts, m, as a function of the number of independent observations, n. Using the data listed in Table 3 and the method illustrated in Figure 5, we found the optimal number of pseudocounts for varying n. The method cannot be valid for n < 212 (vertical dotted line), because the calculated decrease in model description length for is greater than the description length of the model at , but it is not possible for a model to have a negative description length. For n between 212 and 1000, the calculation suggests we use a nearly constant number m of pseudocounts, roughly 19.4. In the limit of very large n, the MDL principle suggests the number of pseudocounts should grow proportionately to n^1/3.

**Figure 7.**
The relationship between the pseudocount proportion α implied by the MDL principle and column relative entropy. Each point represents a multiple alignment column constructed by PSI-BLAST from the aravind103 query set (18,33) run against SWISS-PROT (34). Only columns with independent observations are considered. The x-axis represents the relative entropy D(f′ || p), where f′ is the observed frequency vector of the column after the addition of m₀ = 5.5 pseudocounts, and p is the background amino acid frequency vector implicit in BLOSUM-62. The y-axis represents the pseudocount proportion α calculated from the MDL theory. The upper diagonal line (shown in red) represents the best power-law fit to the data. The lower diagonal line (shown in green) represents the power-law relationship of α to D(f′ || p) that empirically yields the optimal retrieval on the training set. Note that the background frequency vector p is the fixed point of the linear transformation M. Therefore, if f′= p, the increase in the description length of the data is identically zero for all α, implying that the MDL is optimized at . For any finite n, vectors f′ close enough to p also imply an optimal α of 1. A small number of such points are seen at the upper left of this graph.

See this image and copyright information in PMC

References

1. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. - PubMed
1. Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA. 1990;87:2264–2268. - PMC - PubMed
1. Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 1991;219:555–565. - PMC - PubMed
1. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure. Vol. 5. Washington, DC: National Biomedical Research Foundation; 1978. pp. 345–352.
1. Schwartz RM, Dayhoff MO. Matrices for detecting distant relationships. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure. Vol. 5. Washington, DC: National Biomedical Research Foundation; 1978. pp. 353–358.

Publication types

Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PSI-BLAST pseudocounts and the minimum description length principle

Affiliation

PSI-BLAST pseudocounts and the minimum description length principle

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials