Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 4:11:4.
doi: 10.1186/1471-2105-11-4.

Amino acid "little Big Bang": representing amino acid substitution matrices as dot products of Euclidian vectors

Affiliations

Amino acid "little Big Bang": representing amino acid substitution matrices as dot products of Euclidian vectors

Karel Zimmermann et al. BMC Bioinformatics. .

Abstract

Background: Sequence comparisons make use of a one-letter representation for amino acids, the necessary quantitative information being supplied by the substitution matrices. This paper deals with the problem of finding a representation that provides a comprehensive description of amino acid intrinsic properties consistent with the substitution matrices.

Results: We present a Euclidian vector representation of the amino acids, obtained by the singular value decomposition of the substitution matrices. The substitution matrix entries correspond to the dot product of amino acid vectors. We apply this vector encoding to the study of the relative importance of various amino acid physicochemical properties upon the substitution matrices. We also characterize and compare the PAM and BLOSUM series substitution matrices.

Conclusions: This vector encoding introduces a Euclidian metric in the amino acid space, consistent with substitution matrices. Such a numerical description of the amino acid is useful when intrinsic properties of amino acids are necessary, for instance, building sequence profiles or finding consensus sequences, using machine learning algorithms such as Support Vector Machine and Neural Networks algorithms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Top panel: the blue curve is the plot of the substitution matrix elements (210 elements of the lower triangular BLOSUM62, non-rounded, expressed in bit units) sorted by increasing value; the red curve is their approximations, formula image, obtained as the dot products of the raw, non-centered, vectors. Bottom panel: the blue curve is the same as above but with centered matrix elements (i.e., the mean of the shifted BLOSUM62 matrix is zero), the red curve is the approximation computed with the centered vectors, as described in the text. The x-axis corresponds to the sorted 210 lower triangular matrix elements, e.g., the 210th element is the diagonal element corresponding to the tryptophan, sWW - the largest element in the BLOSUM62 matrix. The y-axis corresponds to the values of the matrix elements. Notice that correlation coefficients are very similar in both cases (0.989 for the curves of the top panel vs 0.998 for the curves of the bottom panel).
Figure 2
Figure 2
Plot of the matrix mean (blue), matrix relative entropy (red) and amino acid galaxy radius, Rg (black), for the BLOSUM matrix series (solid for rounded and dashed for non-rounded matrices). The x-axis corresponds to BLOSUM matrix indices, from 30 to 100 by increment of 5, the y-axis corresponds to the values.
Figure 3
Figure 3
Three-dimensional projection of the (non-rounded) BLOSUM62 amino acid galaxy together with its physicochemical characteristics. Property vectors are projected on the left, bottom and rear faces of the parallelepiped. The values on the X, Y, Z axes correspond to the first 3 components of the 20 amino acid vectors.
Figure 4
Figure 4
Plot of the matrix mean (blue), matrix relative entropy (red) and amino acid galaxy radius, Rg (black), for the PAM matrix series. As explained in the text, the observed lack of monotonicity of the matrix mean and galaxy radius curves, is probably due to the fact that rounded PAM matrices were used. The x-axis corresponds to PAM matrix indices, from 10 to 500 by increment of 10, the y-axis corresponds to the values.

Similar articles

Cited by

References

    1. Altschul S. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991;219:555–65. doi: 10.1016/0022-2836(91)90193-A. - DOI - PMC - PubMed
    1. Dayhoff M, Schwartz R, Orcutt B. In: Atlas of protein sequence and structure. Dayhoff M, editor. Vol. 5. National Biomedical Research Fundation, Washington, DC; 1978. A model of evolutionary change in proteins; pp. 345–352.
    1. Henikoff S, Henikoff J. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89:10915–9. doi: 10.1073/pnas.89.22.10915. - DOI - PMC - PubMed
    1. Maetschke S, Towsey M, Boden M. BLOMAP: an Encoding of Amino Acids which improves Signal Peptide Cleavage Site Prediction. Asia Pacific Bioinformatics Conference. 2005. pp. 141–150. full_text.
    1. Swanson R. A vector representation for amino acid sequences. Bull Math Biol. 1984;46:623–639. - PubMed

Publication types

LinkOut - more resources