Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb;37(3):815-24.
doi: 10.1093/nar/gkn981. Epub 2008 Dec 16.

PSI-BLAST pseudocounts and the minimum description length principle

Affiliations

PSI-BLAST pseudocounts and the minimum description length principle

Stephen F Altschul et al. Nucleic Acids Res. 2009 Feb.

Abstract

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Frequency distribution space. (a) The frequencies for the case of three amino acids can be represented by points inside an equilateral triangle. (b) A substitution matrix maps the vertices of the simplex, each of which corresponds to the observation of a single amino acid, to target frequencies in the interior of the simplex. (c) The linear transformation implied by a substitution matrix may be applied to the whole of frequency distribution space, mapping any vector of observed frequencies f to a pseudocount frequency vector g.
Figure 2.
Figure 2.
Linear transformations of frequency distribution space. A substitution matrix imposes a linear transformation M that maps each observed frequency vector f to a pseudocount vector g, and all of frequency distribution space to the smallest simplex shown. For values of α between 0 and 1, the use of pseudocounts imposes a linear transformation formula image that maps f to a point q on the line between f and g, and the frequency distribution space to the intermediate simplex shown.
Figure 3.
Figure 3.
The effect of pseudocounts on the number of independent theories. A theory describing the frequency distribution of three amino acids can be represented as a point within an equilateral triangle. The density of independent theories, which is proportional to (∏pi)− 1/2, is represented by the density of points within the triangle, and increases as one moves away from its center. Using pseudocounts confines the theories one may consider to points within a simplex inside the frequency distribution space. This simplex has a smaller volume than the complete space, and also has a smaller average density of independent theories.
Figure 4.
Figure 4.
Decrease in model description length as a result of using pseudocounts implied by the BLOSUM-62 substitution matrix. For large n, one may calculate the decrease in the description length of the model for α between 0 and 1, compared to the description length of the model at formula image. The total decrease can be decomposed into a decrease in simplex volume, and a decrease in independent theory density. Half as many independent theories corresponds to a decrease of one bit. Positive values on the y-axis represent decreases in model description length.
Figure 5.
Figure 5.
Selecting an optimal proportion of pseudocounts using the MDL principle. For n = 500 and the observed frequencies f listed in Table 3, we apply pseudocounts as implied by the BLOSUM-62 substitution matrix. We use Equation (5) to compute the change in the description length of the data, when compared to the description length of the data at formula image, for α between 0 and 0.1. The dot-dashed curve (in red) shows the increase in the description length of the data. The dashed curve (in blue) shows the decrease in the description length of the model. The total decrease in the description length, shown by the solid curve (in black), is maximized at formula image, which corresponds to 19.4 pseudocounts.
Figure 6.
Figure 6.
Optimal number of pseudocounts, m, as a function of the number of independent observations, n. Using the data listed in Table 3 and the method illustrated in Figure 5, we found the optimal number of pseudocounts for varying n. The method cannot be valid for n < 212 (vertical dotted line), because the calculated decrease in model description length for formula image is greater than the description length of the model at formula image, but it is not possible for a model to have a negative description length. For n between 212 and 1000, the calculation suggests we use a nearly constant number m of pseudocounts, roughly 19.4. In the limit of very large n, the MDL principle suggests the number of pseudocounts should grow proportionately to n1/3.
Figure 7.
Figure 7.
The relationship between the pseudocount proportion α implied by the MDL principle and column relative entropy. Each point represents a multiple alignment column constructed by PSI-BLAST from the aravind103 query set (18,33) run against SWISS-PROT (34). Only columns with formula image independent observations are considered. The x-axis represents the relative entropy D(f′ || p), where f′ is the observed frequency vector of the column after the addition of m0 = 5.5 pseudocounts, and p is the background amino acid frequency vector implicit in BLOSUM-62. The y-axis represents the pseudocount proportion α calculated from the MDL theory. The upper diagonal line (shown in red) represents the best power-law fit to the data. The lower diagonal line (shown in green) represents the power-law relationship of α to D(f′ || p) that empirically yields the optimal retrieval on the training set. Note that the background frequency vector p is the fixed point of the linear transformation M. Therefore, if f′= p, the increase in the description length of the data is identically zero for all α, implying that the MDL is optimized at formula image. For any finite n, vectors f′ close enough to p also imply an optimal α of 1. A small number of such points are seen at the upper left of this graph.

Similar articles

Cited by

References

    1. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. - PubMed
    1. Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA. 1990;87:2264–2268. - PMC - PubMed
    1. Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 1991;219:555–565. - PMC - PubMed
    1. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure. Vol. 5. Washington, DC: National Biomedical Research Foundation; 1978. pp. 345–352.
    1. Schwartz RM, Dayhoff MO. Matrices for detecting distant relationships. In: Dayhoff MO, editor. Atlas of Protein Sequence and Structure. Vol. 5. Washington, DC: National Biomedical Research Foundation; 1978. pp. 353–358.

Publication types

MeSH terms