Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Oct;77(4):159-69.
doi: 10.1007/s00239-013-9565-0. Epub 2013 Jun 7.

Unearthing the root of amino acid similarity

Affiliations
Comparative Study

Unearthing the root of amino acid similarity

James D Stephenson et al. J Mol Evol. 2013 Oct.

Abstract

Similarities and differences between amino acids define the rates at which they substitute for one another within protein sequences and the patterns by which these sequences form protein structures. However, there exist many ways to measure similarity, whether one considers the molecular attributes of individual amino acids, the roles that they play within proteins, or some nuanced contribution of each. One popular approach to representing these relationships is to divide the 20 amino acids of the standard genetic code into groups, thereby forming a simplified amino acid alphabet. Here, we develop a method to compare or combine different simplified alphabets, and apply it to 34 simplified alphabets from the scientific literature. We use this method to show that while different suggestions vary and agree in non-intuitive ways, they combine to reveal a consensus view of amino acid similarity that is clearly rooted in physico-chemistry.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Simplified amino acid alphabets colored according to the method by which they were derived. Dendrogram derived by least squares from the relative similarities of 34 published simplified amino acid alphabets, labeled by Stephenson. Longer branch lengths indicate lower similarity between two alphabets; colors represent method by which each simplified alphabet was derived as described in Table 1
Fig. 2
Fig. 2
Principal components 1 and 2 of the 34 × 34 simplified alphabet similarity matrix colored by derivation method. a Simplified alphabets are shown as spheres and labeled according to the alphabet ID numbering in Table 1. b Variance contribution of the first five principal components of this analysis
Fig. 3
Fig. 3
Consensus amino acid similarity dendrogram from 34 alphabets. Dendrogram constructed by least squares using the similarity data from all 34 simplified amino acid alphabets. Long branches indicate that an amino acid is rarely grouped with any other as part of a simplification scheme. Short path lengths between amino acids suggest high similarity between them
Fig. 4
Fig. 4
Amino acid similarity relationships defined by analysis of proteins closely resemble those derived from analysis of individual amino acid chemistry. Dendrograms constructed by least squares using the similarity data from a 29 studies which considered amino acid residues within proteins sequences and structures, versus c 5 simplified alphabets which were derived from individual amino acid physico-chemistry. Long branches indicate that an amino acid is rarely grouped with any other as part of a simplification scheme. Short path lengths between amino acids suggest high similarity between them. Comparing both dendrograms with a redrawn version of a commonly used chemical property Venn diagram b adapted from Livingstone and Barton (1993) uncovers the physico-chemical basis for many of the dendrogram features. The hydrophobic (blue), polar (red), and both hydrophobic and polar (purple) amino acids are colored to highlight this principal basis of organization within each of the dendrograms
Fig. 5
Fig. 5
Distance between matrices when considering amino acids within proteins and when considering their individual amino acid physico-chemical properties against a background of randomized matrices. Frequency distribution of inter matrix distances between the “individual chemistry” matrix calculated in this study and 1,000,000 random matrices (randomizing rows only) generated from real matrix seeds. The distance between the two matrices (Table 2a, b) was 0.1339
Fig. 6
Fig. 6
Illustration of the method used to compare simplified amino acid alphabets using a fictional 6-letter alphabet for clarity of example. The groupings described by three simplifications, named studies 1–3, for a fictional 6-letter alphabet are initially described as comma-delimited text (shown above each of the green matrices, left). The contents of the green matrices thus represent each simplified alphabet: within each matrix, a value of 1 indicates that two amino acids are grouped as “similar”; a value of 0 indicates otherwise. The blue matrices are constructed by comparing each element in the green matrices pairwise. This time, a match between the corresponding cells for two green matrices results in a 1 within the blue matrix (0 represents a mismatch). Summing the matched values from the blue matrices results forms an overall similarity value, as shown in the final rows of the “line total” column. These similarity values can be assembled in a similarity matrix, shown in red, which records all pairwise inter-alphabet similarities. In this example, alphabets from studies 1 and 3 are the most similar and from 2 and 3 are the least similar

Similar articles

Cited by

References

    1. Albayrak A, Out HH, Sezerman UO. Clustering of protein families into functional subtypes using relative complexity measure with reduced amino acid alphabets. BMC Bioinformatics. 2010;11:428. doi: 10.1186/1471-2105-11-428. - DOI - PMC - PubMed
    1. Andersen CAF, Brunak S. Representation of protein-sequence information by amino acid subalphabets. AI Magazine. 2004;25:97–104.
    1. Benner SA, Cohen MA, Gonnet GH. Amino acid substitution during functionally divergent evolution of protein sequences. Protein Eng. 1994;7:1323–1332. doi: 10.1093/protein/7.11.1323. - DOI - PubMed
    1. Betts MJ, Russell RB. Bioinformatics for geneticists. New York: Wiley; 2003. Amino acid properties and consequences of substitutions.
    1. Cannata N, Toppo S, Romualdi C, Valle G. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics. 2002;18:1102–1108. doi: 10.1093/bioinformatics/18.8.1102. - DOI - PubMed

Publication types