Solving the protein sequence metric problem
- PMID: 15851683
- PMCID: PMC1088356
- DOI: 10.1073/pnas.0408677102
Solving the protein sequence metric problem
Abstract
Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.
Figures



Similar articles
-
Molecular architecture of the DNA-binding region and its relationship to classification of basic helix-loop-helix proteins.Mol Biol Evol. 2007 Jan;24(1):192-202. doi: 10.1093/molbev/msl143. Epub 2006 Oct 13. Mol Biol Evol. 2007. PMID: 17041153
-
Spectral analysis of sequence variability in basic-helix-loop-helix (bHLH) protein domains.Evol Bioinform Online. 2007 Feb 9;2:187-96. Evol Bioinform Online. 2007. PMID: 19455213 Free PMC article.
-
Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences.J Mol Biol. 2002 Jun 7;319(3):729-43. doi: 10.1016/S0022-2836(02)00239-5. J Mol Biol. 2002. PMID: 12054866
-
Computational tools for protein modeling.Curr Protein Pept Sci. 2000 Jul;1(1):1-21. doi: 10.2174/1389203003381469. Curr Protein Pept Sci. 2000. PMID: 12369918 Review.
-
[Sequence variation of HIV and bioinformatics].Uirusu. 2004 Jun;54(1):33-8. doi: 10.2222/jsv.54.33. Uirusu. 2004. PMID: 15449902 Review. Japanese.
Cited by
-
Evolutionary pattern in the OXT-OXTR system in primates: coevolution and positive selection footprints.Proc Natl Acad Sci U S A. 2015 Jan 6;112(1):88-93. doi: 10.1073/pnas.1419399112. Epub 2014 Dec 22. Proc Natl Acad Sci U S A. 2015. PMID: 25535371 Free PMC article.
-
The role of insulin C-peptide in the coevolution analyses of the insulin signaling pathway: a hint for its functions.PLoS One. 2012;7(12):e52847. doi: 10.1371/journal.pone.0052847. Epub 2012 Dec 27. PLoS One. 2012. PMID: 23300796 Free PMC article.
-
Evolution of substrate recognition sites (SRSs) in cytochromes P450 from Apiaceae exemplified by the CYP71AJ subfamily.BMC Evol Biol. 2015 Jun 26;15:122. doi: 10.1186/s12862-015-0396-z. BMC Evol Biol. 2015. PMID: 26111527 Free PMC article.
-
Ab initio detection of fuzzy amino acid tandem repeats in protein sequences.BMC Bioinformatics. 2012 Mar 21;13 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2105-13-S3-S8. BMC Bioinformatics. 2012. PMID: 22536906 Free PMC article.
-
Prediction and Analysis of Post-Translational Pyruvoyl Residue Modification Sites from Internal Serines in Proteins.PLoS One. 2013 Jun 21;8(6):e66678. doi: 10.1371/journal.pone.0066678. Print 2013. PLoS One. 2013. PMID: 23805260 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
Molecular Biology Databases