Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 May 3;102(18):6395-400.
doi: 10.1073/pnas.0408677102. Epub 2005 Apr 25.

Solving the protein sequence metric problem

Affiliations
Comparative Study

Solving the protein sequence metric problem

William R Atchley et al. Proc Natl Acad Sci U S A. .

Abstract

Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Plot of scores on Factors I-III for 20 amino acids.
Fig. 2.
Fig. 2.
Unweighted pair group method with arithmatic mean cluster analysis of distances computed from the scores from Factors I-IV. Factor V was omitted to conserve space.
Fig. 3.
Fig. 3.
Analysis of variance of Factor I scores for amino acid sites 1-13 for five DNA-binding groups in the bHLH of proteins. Circled values do not differ significantly.

Similar articles

Cited by

References

    1. Grantham, R. (1974) Science 185, 862-864. - PubMed
    1. Sneath, P. H. A. (1966) J. Theor. Biol. 12, 157-195. - PubMed
    1. Atchley, W. R. & Buck, M. J. (2005) J. Mol. Evol., in press. - PubMed
    1. Atchley, W. R. & Fernandes, A. D. (2005) Proc. Natl. Acad. Sci. USA 102, 6401-6406. - PMC - PubMed
    1. Atchley, W. R., Terhalle, W. & Dress, A. (1999) J. Mol. Evol. 48, 501-516. - PubMed

Publication types