. 2009 Jan 6:10:6.

doi: 10.1186/1471-2105-10-6.

Automated alphabet reduction for protein datasets

Jaume Bacardit¹, Michael Stout, Jonathan D Hirst, Alfonso Valencia, Robert E Smith, Natalio Krasnogor

Affiliations

Affiliation

¹ ASAP research group, School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK. jaume.bacardit@nottingham.ac.uk

PMID: 19126227
PMCID: PMC2646702
DOI: 10.1186/1471-2105-10-6

Automated alphabet reduction for protein datasets

Jaume Bacardit et al. BMC Bioinformatics. 2009.

. 2009 Jan 6:10:6.

doi: 10.1186/1471-2105-10-6.

Authors

Jaume Bacardit¹, Michael Stout, Jonathan D Hirst, Alfonso Valencia, Robert E Smith, Natalio Krasnogor

Affiliation

¹ ASAP research group, School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK. jaume.bacardit@nottingham.ac.uk

PMID: 19126227
PMCID: PMC2646702
DOI: 10.1186/1471-2105-10-6

Abstract

Background: We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.

Results: We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations.

Conclusion: Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.

PubMed Disclaimer

Figures

**Figure 1**
**Alphabet reductions for the CN feature**. Groups are separated by '/'. Solid rectangle marks amino acids that remain in the same group for all four alphabets.

**Figure 2**
**Alphabet reductions for the Solvent Accessibility feature**. Groups are separated by '/'. Solid rectangle marks amino acids that remain in the same group for all four alphabets.

**Figure 3**
**Alphabet Reduction process adapted to the Position-Specific Scoring Matrix residue representation**.

See this image and copyright information in PMC

Cited by

Folding by numbers: primary sequence statistics and their use in studying protein folding.
Wathen B, Jia Z. Wathen B, et al. Int J Mol Sci. 2009 Apr 8;10(4):1567-1589. doi: 10.3390/ijms10041567. Int J Mol Sci. 2009. PMID: 19468326 Free PMC article. Review.
A Data Adaptive Biological Sequence Representation for Supervised Learning.
Cakin H, Gorgulu B, Baydogan MG, Zou N, Li J. Cakin H, et al. J Healthc Inform Res. 2018 Oct 26;2(4):448-471. doi: 10.1007/s41666-018-0038-5. eCollection 2018 Dec. J Healthc Inform Res. 2018. PMID: 35415416 Free PMC article.
Generation of tactile maps for artificial skin.
McGregor S, Polani D, Dautenhahn K. McGregor S, et al. PLoS One. 2011;6(11):e26561. doi: 10.1371/journal.pone.0026561. Epub 2011 Nov 10. PLoS One. 2011. PMID: 22102863 Free PMC article.
Lambda: the local aligner for massive biological data.
Hauswedell H, Singer J, Reinert K. Hauswedell H, et al. Bioinformatics. 2014 Sep 1;30(17):i349-55. doi: 10.1093/bioinformatics/btu439. Bioinformatics. 2014. PMID: 25161219 Free PMC article.
Prediction of bacterial E3 ubiquitin ligase effectors using reduced amino acid peptide fingerprinting.
McDermott JE, Cort JR, Nakayasu ES, Pruneda JN, Overall C, Adkins JN. McDermott JE, et al. PeerJ. 2019 Jun 7;7:e7055. doi: 10.7717/peerj.7055. eCollection 2019. PeerJ. 2019. PMID: 31211016 Free PMC article.

See all "Cited by" articles

References

1. Misura KM, Chivian D, Rohl CA, Kim DE, Baker D. Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc Natl Acad Sci USA. 2006;103:5361–5366. - PMC - PubMed
1. Dill KA. Theory for the folding and stability of globular proteins. Biochemistry. 1985;24:1501–1509. - PubMed
1. Yue K, Fiebig KM, Thomas PD, Chan HS, Shakhnovich EI, Dill KA. A test of lattice protein folding algorithms. Proc Natl Acad Sci USA. 1995;92:325–329. - PMC - PubMed
1. Krasnogor N, Blackburne B, Burke E, Hirst J. Multimeme Algorithms for Protein Structure Prediction. Proceedings of the Parallel Problem Solving from Nature VII Lecture Notes in Computer Science. 2002;2439:769–778.
1. Stout M, Bacardit J, Hirst JD, Krasnogor N, Blazewicz J. Applications of Evolutionary Computing, EvoWorkshops 2006. Springer LNCS 3907; 2006. From HP Lattice Models to Real Proteins: Coordination Number Prediction Using Learning Classifier Systems; pp. 208–220.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated alphabet reduction for protein datasets

Affiliation

Automated alphabet reduction for protein datasets

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous