Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan 6:10:6.
doi: 10.1186/1471-2105-10-6.

Automated alphabet reduction for protein datasets

Affiliations

Automated alphabet reduction for protein datasets

Jaume Bacardit et al. BMC Bioinformatics. .

Abstract

Background: We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.

Results: We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations.

Conclusion: Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Alphabet reductions for the CN feature. Groups are separated by '/'. Solid rectangle marks amino acids that remain in the same group for all four alphabets.
Figure 2
Figure 2
Alphabet reductions for the Solvent Accessibility feature. Groups are separated by '/'. Solid rectangle marks amino acids that remain in the same group for all four alphabets.
Figure 3
Figure 3
Alphabet Reduction process adapted to the Position-Specific Scoring Matrix residue representation.

Similar articles

Cited by

References

    1. Misura KM, Chivian D, Rohl CA, Kim DE, Baker D. Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc Natl Acad Sci USA. 2006;103:5361–5366. - PMC - PubMed
    1. Dill KA. Theory for the folding and stability of globular proteins. Biochemistry. 1985;24:1501–1509. - PubMed
    1. Yue K, Fiebig KM, Thomas PD, Chan HS, Shakhnovich EI, Dill KA. A test of lattice protein folding algorithms. Proc Natl Acad Sci USA. 1995;92:325–329. - PMC - PubMed
    1. Krasnogor N, Blackburne B, Burke E, Hirst J. Multimeme Algorithms for Protein Structure Prediction. Proceedings of the Parallel Problem Solving from Nature VII Lecture Notes in Computer Science. 2002;2439:769–778.
    1. Stout M, Bacardit J, Hirst JD, Krasnogor N, Blazewicz J. Applications of Evolutionary Computing, EvoWorkshops 2006. Springer LNCS 3907; 2006. From HP Lattice Models to Real Proteins: Coordination Number Prediction Using Learning Classifier Systems; pp. 208–220.

Publication types