Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb;5(2):e1000281.
doi: 10.1371/journal.pcbi.1000281. Epub 2009 Feb 6.

Predicting peptide structures in native proteins from physical simulations of fragments

Affiliations

Predicting peptide structures in native proteins from physical simulations of fragments

Vincent A Voelz et al. PLoS Comput Biol. 2009 Feb.

Abstract

It has long been proposed that much of the information encoding how a protein folds is contained locally in the peptide chain. Here we present a large-scale simulation study designed to examine the extent to which conformations of peptide fragments in water predict native conformations in proteins. We perform replica exchange molecular dynamics (REMD) simulations of 872 8-mer, 12-mer, and 16-mer peptide fragments from 13 proteins using the AMBER 96 force field and the OBC implicit solvent model. To analyze the simulations, we compute various contact-based metrics, such as contact probability, and then apply Bayesian classifier methods to infer which metastable contacts are likely to be native vs. non-native. We find that a simple measure, the observed contact probability, is largely more predictive of a peptide's native structure in the protein than combinations of metrics or multi-body components. Our best classification model is a logistic regression model that can achieve up to 63% correct classifications for 8-mers, 71% for 12-mers, and 76% for 16-mers. We validate these results on fragments of a protein outside our training set. We conclude that local structure provides information to solve some but not all of the conformational search problem. These results help improve our understanding of folding mechanisms, and have implications for improving physics-based conformational sampling and structure prediction using all-atom molecular simulations.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Cluster conformations from fragment simulations sample native-like states.
(A) For each target sequence and fragment length, the C-alpha RMSD-to-native values (in Å) for all representative cluster conformations along the target sequence are shown. Each line on the plot corresponds to a cluster conformation, color-coded by native secondary structure: alpha-helix (yellow), beta-hairpin (cyan), or other turn types (magenta). The relative shading of the lines are proportional to the population fraction. The horizontal axis is the sequence position along the protein chain. (B) The fraction of cluster conformations that sample within a particular RMSD-to-native, across all fragment simulations of a given chain length. For comparison, the black line shows the results for a random distribution of C-alpha RMSD values calculated from native protein structures (see Methods).
Figure 2
Figure 2. A summary of the contact metrics examined in this study.
Each metric is calculated on a per-contact basis from the simulation data. Further details are in Methods.
Figure 3
Figure 3. The model relevances for each contact metric in the best 8-mer, 12-mer, and 16-mer linear regression models.
The formula image values show that contact probability (CPROB) is the most important metric in predicting whether a contact observed in the computer simulations is likely to be in the native structure of the protein. The model relevance formula image of a contact metric formula image is defined as formula image, where formula image is the logistic regression coefficient for the metric, and formula image is the standard deviation of the metric.
Figure 4
Figure 4. Testing and training curves for the logistic regression models.
Results are shown for models built from the (A) 8-mer simulation data, (B) 12-mer data, and (C) 16-mer data. For each contact definition we tested (formula image, and sidechain-centroid), shown is the model quality (Q) for a series of models, calculated from the training data (dotted) and the testing data (solid) (see Methods for details). The larger the formula image value, the more predictive the model. From left to right, the model quality (Q) for the best 1-, 2-, 3-, 4-, and 5-metric regression models are plotted, labeled with the sequence of additional metrics that increasingly improve the model quality.
Figure 5
Figure 5. Contact prediction success for all proteins in the test set.
Predictions were made using the best logistic regression models built from the 8-mer, 12-mer, and 16-mer simulations.
Figure 6
Figure 6. A contact map showing the results of the best 16-mer regression model for an example target, T0363.
Above the diagonal, the grayscale values at each contact position correspond to ‘logit’ values formula image given by the best logistic regression model trained on all the 16-mer simulation data. The background gray value corresponds to contacts not sampled by the fragment simulations, and is colored according to the logit value threshold formula image used for the classification criterion; logit values formula image are classified as native and appear darker, while logit values formula image are classified as non-native and appear lighter. On the lower diagonal are shown the native contacts in the range sampled by the fragment simulations. (8-mer, 12-mer, and 16-mer predictions for all targets are shown in Text S1.)
Figure 7
Figure 7. A target from CASP6 (1whz) used to test the classification model.
Ribbon diagram of the X-ray crystal structure was made with pymol.
Figure 8
Figure 8. Logit values and prediction successes given by the best classification models for fragment simulations of 1whz.
The upper diagonal shows the logit scores formula image with prediction success rates. The lower diagonal shows native contacts in the range sampled by the fragment simulations. As the fragment simulations increase in length, clear signals of predicted secondary structures begin to emerge. For comparison (bottom row) are shown the logit values and prediction scores given by the best regression model trained only on contact probability. The similarity of the two models shows that most of the predictive power comes directly from the frequency of contacts observed in the simulation data.
Figure 9
Figure 9. RMSD-to-native of cluster conformations plotted versus cluster conformation scores for all cluster conformations extracted from 16-mer fragment simulations of 1whz.
Each dot represents a cluster conformation, color-coded according to its region along the protein sequence: residues 1–20 (cyan), residues 12–39 (magenta), residues 28–53 (yellow), and residues 42–70 (cyan). On the left (residues 1–20 and 28–53) are examples of high conformational cluster scores predicting native structures, while on the right (residues 12–39 and 42–70) are examples of high-scoring decoy structures.
Figure 10
Figure 10. Examples of pairwise stability and pairwise cooperativity used in calculating mutual stability and cooperativity scores.
For a particular pair of contacts formula image, formula image are indicator variables: 1 if the contact is made, and 0 if the contact is not made. The pairwise distribution formula image represents the joint probability of contacts formula image being made or not. Pairwise stability is at a maximum when both contacts formula image are made with a probability of 1. Pairwise cooperativity is maximized when formula image are formed in an all-or-nothing way, so as to maximize the mutual information between formula image.

References

    1. Simons K, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol. 1997;268:209–225. - PubMed
    1. Rohl C, Strauss C, Misura K, Baker D. Protein structure prediction using rosetta. Methods Enzymol. 2004;383:66–93. - PubMed
    1. Ozkan SB, Wu GH, Chodera JD, Dill KA. Protein folding by zipping and assembly. Proceedings of the National Academy of Sciences. 2007;104:11987–11992. - PMC - PubMed
    1. Shell MS, Ozkan SB, Voelz V, Wu GA, Dill KA. A blind test of physics-based prediction of protein structures. Biophysical Journal. 2008 In press. - PMC - PubMed
    1. Voelz VA, Dill KA. Exploring zipping and assembly as a folding principle. Proteins: Structure, Function, and Bioinformatics. 2007;66:877–888. - PubMed

Publication types