Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Nov 18:9:487.
doi: 10.1186/1471-2105-9-487.

Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable

Affiliations

Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable

Myron Peto et al. BMC Bioinformatics. .

Abstract

Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly- or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy -- in some cases exceeding 95%.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The hexagonal and the triangular shapes used for the designability studies. There are 20,843 different compact conformations unrelated by shape symmetries for this hexagon and 22,104 for this triangle.
Figure 2
Figure 2
The dependence of the logarithm of the number of conformations Nconf on the number NS of sequences folding to them. a) corresponds to data for the hexagonal shape and b) is for the triangular shape.
Figure 3
Figure 3
The most designable conformations for a) the hexagonal and b) the triangular shape. Conformation a) has 54 sequences folding to it and 11 peptide bonds connecting the protein interior with exterior; conformation b) has 423 sequences folding to it and 9 interior-exterior spanning peptide bonds.
Figure 4
Figure 4
Average energy difference between the ground state and the next lowest energy state for different values of designability NS for the hexagonal (a) and triangular (b) shapes. Although there is a strong visible trend towards a higher energy gap as the conformations become more designable, there are exceptions particularly for the most designable conformations (corresponding to the largest Ns), having in both cases average energy gaps below the maximum.
Figure 5
Figure 5
The average number of sequences folding to conformations having the specified number of covalent bonds connecting protein interior with exterior for a) hexagonal and b) triangular shapes.
Figure 6
Figure 6
ROC curve for the Naïve Bayes classifier. Tripeptide segments are used to classify binary sequences folding to highly- and poorly-designable conformations of the hexagonal shape. The diagonal line y = x, which we would expect if we used a classifier that randomly guessed which class to assign to a sequence, has been added for clarification.
Figure 7
Figure 7
ROC curve for the Decision Tree (J48) classifier. Tripeptide segments are used to classify binary sequences folding to highly- and poorly-designable conformations for both the hexagonal and triangular shapes. The line x = y, expected for the random case is shown for comparison.

Similar articles

Cited by

References

    1. Chan HS, Dill KA. The effects of internal constraints on the configurations of chain molecules. J Chem Phys. 1990;92:3118–3135. doi: 10.1063/1.458605. - DOI
    1. Chan HS, Dill KA. Origins of structure in globular proteins. Proc Natl Acad Sci USA. 1990;87:6388–6392. doi: 10.1073/pnas.87.16.6388. - DOI - PMC - PubMed
    1. Chan HS, Dill KA. Compact polymers. Macromolecules. 2003;22:4559. doi: 10.1021/ma00202a031. - DOI
    1. Covell DG, Jernigan RL. Conformations of Folded Proteins in Restricted Spaces. Biochemistry. 1990;29:3287–3294. doi: 10.1021/bi00465a020. - DOI - PubMed
    1. Crippen GM. Enumeration of cubic lattice walks by contact class. J Chem Phys. 2000;112:11065–11068. doi: 10.1063/1.481746. - DOI

Publication types