. 2009 Feb;5(2):e1000281.

doi: 10.1371/journal.pcbi.1000281. Epub 2009 Feb 6.

Predicting peptide structures in native proteins from physical simulations of fragments

Vincent A Voelz¹, M Scott Shell, Ken A Dill

Affiliations

PMID: 19197352
PMCID: PMC2629132
DOI: 10.1371/journal.pcbi.1000281

Predicting peptide structures in native proteins from physical simulations of fragments

Vincent A Voelz et al. PLoS Comput Biol. 2009 Feb.

. 2009 Feb;5(2):e1000281.

doi: 10.1371/journal.pcbi.1000281. Epub 2009 Feb 6.

Authors

Vincent A Voelz¹, M Scott Shell, Ken A Dill

Affiliation

¹ Department of Chemistry, Stanford University, Stanford, CA, USA. vvoelz@stanford.edu

PMID: 19197352
PMCID: PMC2629132
DOI: 10.1371/journal.pcbi.1000281

Abstract

It has long been proposed that much of the information encoding how a protein folds is contained locally in the peptide chain. Here we present a large-scale simulation study designed to examine the extent to which conformations of peptide fragments in water predict native conformations in proteins. We perform replica exchange molecular dynamics (REMD) simulations of 872 8-mer, 12-mer, and 16-mer peptide fragments from 13 proteins using the AMBER 96 force field and the OBC implicit solvent model. To analyze the simulations, we compute various contact-based metrics, such as contact probability, and then apply Bayesian classifier methods to infer which metastable contacts are likely to be native vs. non-native. We find that a simple measure, the observed contact probability, is largely more predictive of a peptide's native structure in the protein than combinations of metrics or multi-body components. Our best classification model is a logistic regression model that can achieve up to 63% correct classifications for 8-mers, 71% for 12-mers, and 76% for 16-mers. We validate these results on fragments of a protein outside our training set. We conclude that local structure provides information to solve some but not all of the conformational search problem. These results help improve our understanding of folding mechanisms, and have implications for improving physics-based conformational sampling and structure prediction using all-atom molecular simulations.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Cluster conformations from fragment simulations sample native-like states.**
(A) For each target sequence and fragment length, the C-alpha RMSD-to-native values (in Å) for all representative cluster conformations along the target sequence are shown. Each line on the plot corresponds to a cluster conformation, color-coded by native secondary structure: alpha-helix (yellow), beta-hairpin (cyan), or other turn types (magenta). The relative shading of the lines are proportional to the population fraction. The horizontal axis is the sequence position along the protein chain. (B) The fraction of cluster conformations that sample within a particular RMSD-to-native, across all fragment simulations of a given chain length. For comparison, the black line shows the results for a random distribution of C-alpha RMSD values calculated from native protein structures (see Methods).

**Figure 2. A summary of the contact metrics examined in this study.**
Each metric is calculated on a per-contact basis from the simulation data. Further details are in Methods.

**Figure 3. The model relevances for each contact metric in the best 8-mer, 12-mer, and 16-mer linear regression models.**
The values show that contact probability (CPROB) is the most important metric in predicting whether a contact observed in the computer simulations is likely to be in the native structure of the protein. The model relevance of a contact metric is defined as , where is the logistic regression coefficient for the metric, and is the standard deviation of the metric.

formula image — **Figure 3. The model relevances for each contact metric in the best 8-mer, 12-mer, and 16-mer linear regression models.**
The values show that contact probability (CPROB) is the most important metric in predicting whether a contact observed in the computer simulations is likely to be in the native structure of the protein. The model relevance of a contact metric is defined as , where is the logistic regression coefficient for the metric, and is the standard deviation of the metric.

**Figure 4. Testing and training curves for the logistic regression models.**
Results are shown for models built from the (A) 8-mer simulation data, (B) 12-mer data, and (C) 16-mer data. For each contact definition we tested (, and sidechain-centroid), shown is the model quality (Q) for a series of models, calculated from the training data (dotted) and the testing data (solid) (see Methods for details). The larger the value, the more predictive the model. From left to right, the model quality (Q) for the best 1-, 2-, 3-, 4-, and 5-metric regression models are plotted, labeled with the sequence of additional metrics that increasingly improve the model quality.

**Figure 5. Contact prediction success for all proteins in the test set.**
Predictions were made using the best logistic regression models built from the 8-mer, 12-mer, and 16-mer simulations.

**Figure 6. A contact map showing the results of the best 16-mer regression model for an example target, T0363.**
Above the diagonal, the grayscale values at each contact position correspond to ‘logit’ values given by the best logistic regression model trained on all the 16-mer simulation data. The background gray value corresponds to contacts not sampled by the fragment simulations, and is colored according to the logit value threshold used for the classification criterion; logit values are classified as native and appear darker, while logit values are classified as non-native and appear lighter. On the lower diagonal are shown the native contacts in the range sampled by the fragment simulations. (8-mer, 12-mer, and 16-mer predictions for all targets are shown in Text S1.)

**Figure 7. A target from CASP6 (1whz) used to test the classification model.**
Ribbon diagram of the X-ray crystal structure was made with pymol.

**Figure 8. Logit values and prediction successes given by the best classification models for fragment simulations of 1whz.**
The upper diagonal shows the logit scores with prediction success rates. The lower diagonal shows native contacts in the range sampled by the fragment simulations. As the fragment simulations increase in length, clear signals of predicted secondary structures begin to emerge. For comparison (bottom row) are shown the logit values and prediction scores given by the best regression model trained only on contact probability. The similarity of the two models shows that most of the predictive power comes directly from the frequency of contacts observed in the simulation data.

**Figure 9. RMSD-to-native of cluster conformations plotted versus cluster conformation scores for all cluster conformations extracted from 16-mer fragment simulations of 1whz.**
Each dot represents a cluster conformation, color-coded according to its region along the protein sequence: residues 1–20 (cyan), residues 12–39 (magenta), residues 28–53 (yellow), and residues 42–70 (cyan). On the left (residues 1–20 and 28–53) are examples of high conformational cluster scores predicting native structures, while on the right (residues 12–39 and 42–70) are examples of high-scoring decoy structures.

**Figure 10. Examples of pairwise stability and pairwise cooperativity used in calculating mutual stability and cooperativity scores.**
For a particular pair of contacts , are indicator variables: 1 if the contact is made, and 0 if the contact is not made. The pairwise distribution represents the joint probability of contacts being made or not. Pairwise *stability* is at a maximum when both contacts are made with a probability of 1. Pairwise *cooperativity* is maximized when are formed in an all-or-nothing way, so as to maximize the mutual information between .

See this image and copyright information in PMC

References

1. Simons K, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol. 1997;268:209–225. - PubMed
1. Rohl C, Strauss C, Misura K, Baker D. Protein structure prediction using rosetta. Methods Enzymol. 2004;383:66–93. - PubMed
1. Ozkan SB, Wu GH, Chodera JD, Dill KA. Protein folding by zipping and assembly. Proceedings of the National Academy of Sciences. 2007;104:11987–11992. - PMC - PubMed
1. Shell MS, Ozkan SB, Voelz V, Wu GA, Dill KA. A blind test of physics-based prediction of protein structures. Biophysical Journal. 2008 In press. - PMC - PubMed
1. Voelz VA, Dill KA. Exploring zipping and assembly as a folding principle. Proteins: Structure, Function, and Bioinformatics. 2007;66:877–888. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting peptide structures in native proteins from physical simulations of fragments

Affiliation

Predicting peptide structures in native proteins from physical simulations of fragments

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous