Automated alphabet reduction for protein datasets
- PMID: 19126227
- PMCID: PMC2646702
- DOI: 10.1186/1471-2105-10-6
Automated alphabet reduction for protein datasets
Abstract
Background: We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.
Results: We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations.
Conclusion: Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.
Figures



Similar articles
-
Reduced alphabet for protein folding prediction.Proteins. 2015 Apr;83(4):631-9. doi: 10.1002/prot.24762. Epub 2015 Feb 5. Proteins. 2015. PMID: 25641420
-
Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment.Bioinformatics. 2009 Jun 1;25(11):1356-62. doi: 10.1093/bioinformatics/btp164. Epub 2009 Apr 7. Bioinformatics. 2009. PMID: 19351620 Free PMC article.
-
A genetic approach for building different alphabets for peptide and protein classification.BMC Bioinformatics. 2008 Jan 24;9:45. doi: 10.1186/1471-2105-9-45. BMC Bioinformatics. 2008. PMID: 18218100 Free PMC article.
-
State-of-the-art bioinformatics protein structure prediction tools (Review).Int J Mol Med. 2011 Sep;28(3):295-310. doi: 10.3892/ijmm.2011.705. Epub 2011 May 23. Int J Mol Med. 2011. PMID: 21617841 Review.
-
Research progress of reduced amino acid alphabets in protein analysis and prediction.Comput Struct Biotechnol J. 2022 Jul 4;20:3503-3510. doi: 10.1016/j.csbj.2022.07.001. eCollection 2022. Comput Struct Biotechnol J. 2022. PMID: 35860409 Free PMC article. Review.
Cited by
-
Folding by numbers: primary sequence statistics and their use in studying protein folding.Int J Mol Sci. 2009 Apr 8;10(4):1567-1589. doi: 10.3390/ijms10041567. Int J Mol Sci. 2009. PMID: 19468326 Free PMC article. Review.
-
A Data Adaptive Biological Sequence Representation for Supervised Learning.J Healthc Inform Res. 2018 Oct 26;2(4):448-471. doi: 10.1007/s41666-018-0038-5. eCollection 2018 Dec. J Healthc Inform Res. 2018. PMID: 35415416 Free PMC article.
-
Generation of tactile maps for artificial skin.PLoS One. 2011;6(11):e26561. doi: 10.1371/journal.pone.0026561. Epub 2011 Nov 10. PLoS One. 2011. PMID: 22102863 Free PMC article.
-
Lambda: the local aligner for massive biological data.Bioinformatics. 2014 Sep 1;30(17):i349-55. doi: 10.1093/bioinformatics/btu439. Bioinformatics. 2014. PMID: 25161219 Free PMC article.
-
Prediction of bacterial E3 ubiquitin ligase effectors using reduced amino acid peptide fingerprinting.PeerJ. 2019 Jun 7;7:e7055. doi: 10.7717/peerj.7055. eCollection 2019. PeerJ. 2019. PMID: 31211016 Free PMC article.
References
-
- Dill KA. Theory for the folding and stability of globular proteins. Biochemistry. 1985;24:1501–1509. - PubMed
-
- Krasnogor N, Blackburne B, Burke E, Hirst J. Multimeme Algorithms for Protein Structure Prediction. Proceedings of the Parallel Problem Solving from Nature VII Lecture Notes in Computer Science. 2002;2439:769–778.
-
- Stout M, Bacardit J, Hirst JD, Krasnogor N, Blazewicz J. Applications of Evolutionary Computing, EvoWorkshops 2006. Springer LNCS 3907; 2006. From HP Lattice Models to Real Proteins: Coordination Number Prediction Using Learning Classifier Systems; pp. 208–220.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous