Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct 9;10(10):e1003847.
doi: 10.1371/journal.pcbi.1003847. eCollection 2014 Oct.

Improving contact prediction along three dimensions

Affiliations

Improving contact prediction along three dimensions

Christoph Feinauer et al. PLoS Comput Biol. .

Abstract

Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Examples of qualitative contact prediction improvement.
Gray squares: contacts observed in crystal structure, Ovals: predicted contacts (green: correctly predicted, red: incorrectly predicted). Predicted very short-range contacts (not considered in the assessment) are drawn in pale colors.Top row: comparison of plmDCA and gplmDCA, bottom row: plmDCA and plmDCA20. Left panels: contact prediction maps built by plmDCA and gplmDCA/plmDCA20 using protein sequences homologous to 1JFU:A as explained in Methods. For this protein plmDCA predicts a number of strong couplings at both the N-terminus and the C-terminus, which arise from the high sequence variability at both ends of proteins homologous to 1JFU:A and the many gaps in the multiple sequence alignments at these positions. In gplmDCA these gaps lead to adjustment of gap parameters and not to contact predictions, in plmDCA20 these couplings are not included in contact scoring, leading to an analogous effect. Right panels: analogous results using protein sequences homologous to 1ATZ where gplmDCA and plmDCA20 remove strong spurious couplings at the C-terminus.
Figure 2
Figure 2. Prediction precision (PPV), average over all proteins in the main test data set.
The curves show for PSICOV, plmDCA, gplmDCA and plmDCA20 the average of the number of correct predictions in the n highest scoring pairs divided by n. Left panel: PPV for absolute contact index; the horizontal axis shows n. gplmDCA and plmDCA20 yield higher absolute PPV than plmDCA for all n. PSICOV is more often right than plmDCA in its prediction of the few first (strongest) contacts (n = 1), but is inferior to both plmDCA20 and gplmDCA for this test set. Right panel: PPV for relative contact index (fraction of protein length). the horizontal axis shows (n/N).
Figure 3
Figure 3. Contact prediction accuracy (mean absolute PPV) for proteins in the main test set by plmDCA (abscissa) vs gplmDCA (ordinate) in left plot and plmDCA vs plmDCA20 in the right plot.
Most of the points fall above the diagonal indicating that gplmDCA is more accurate than plmDCA for most of proteins in the test set. Data points can be fitted a straight line by Ordinary Least Squares regression, with slope 1.0764±0.005 (R 2 = 0.987) indicating that gplmDCA is generally relatively more accurate than plmDCA the more accurate is plmDCA itself. The slope of OLS regression line for plmDCA20 is 1.106±0.004 (R 2 = 0.992).
Figure 4
Figure 4. Contact prediction accuracy for proteins in the test set by plmDCA20, gplmDCA and plmDCA vs number of homology reduced sequences in the alignment (maximum 90% sequence identity), when considering top 10%, 25% (top row), 50% and 100% (bottom row) contacts, 100% being the same number of contacts as the number of amino acids in the protein.
The advantage of gplmDCA and plmDCA20 is particularly interesting in ranges highlighted by vertical dotted lines. For the top 10% and top 25% (top row) these ranges are approximately 60–2500 and 250–23000 sequences, while for the top 50% and top 100% (bottom row) they extend from about 250 sequences in the alignment and upwards. PSICOV outperforms both plmDCA and gplmDCA when there are less than about 100 sequences in the alignment.
Figure 5
Figure 5. Prediction performance as assessed by relative PPV and criterion for gplmDCA, plmDCA20 and plmDCA run on Pfam and HHblits alignments in the reduced test data set.
The reduced test data set comprises the proteins in the main test data set where a comparison can be made to Pfam alignments, as described in Methods.
Figure 6
Figure 6. Scatter plots of prediction by absolute PPV and criterion for individual proteins in the reduced test data set.
Top row shows, analogously to Figure 3 (in Results, for the main data set), gplmDCA vs plmDCA for Pfam alignments (left panel) and for HHblits alignments (right panel). Center row shows analogous data, but for plmDCA vs plmDCA20 comparison. Bottom row shows prediction for HHblits alignments vs Pfam alignments using plmDCA (left panel), gplmDCA (central panel) and plmDCA20 (right panel).
Figure 7
Figure 7. Difference in contact prediction between plmDCA and gplmDCA for sensor domain of histidine kinase DcuS from E.coli (pdbid:3BY8:A).
Left figure: protein structure, with some of contacts uniquely predicted by gplmDCA marked by dashed lines. Center and right: contact maps, with the region of interest marked in faint blue. Predictions by both plmDCA20 and gplmDCA differ slightly, but maintain the same accuracy and uncover additional contacts, important for protein structure prediction.
Figure 8
Figure 8. Mispredictions.
Among the 729 proteins plotted in Figure 3 there is less than 5% prominent outliers where plmDCA (model with no gap parameters) clearly does better than gplmDCA (model with gap parameters). Upper row depicts gplmDCA predictions, lower — plmDCA20. Left panels show the contact maps of protein S, where gplmDCA wrongly predicts a number of spurious contacts between N- and C- terminii Right panels, contact maps of transcription elongation factor Spt4. The prediction artifacts of gplmDCA are not detectable in plmDCA20 predictions. For further discussion, see main text.

References

    1. Anfinsen CB (1973) Principles that Govern the Folding of Protein Chains. Science 181: 223–230. - PubMed
    1. UniProt Consortium, (2013) et al. Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Research 41: D43–D47. - PMC - PubMed
    1. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, et al. (2012) The Pfam protein families database. Nucleic Acids Research 40: D290–D301. - PMC - PubMed
    1. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39: W29–W37. - PMC - PubMed
    1. Remmert M, Biegert A, Hauser A, Söding J (2011) HHblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature methods 9: 173–175. - PubMed

Publication types

MeSH terms

LinkOut - more resources