Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep 24;110(39):15674-9.
doi: 10.1073/pnas.1314045110. Epub 2013 Sep 5.

Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era

Affiliations

Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era

Hetunandan Kamisetty et al. Proc Natl Acad Sci U S A. .

Erratum in

  • Proc Natl Acad Sci U S A. 2013 Nov 12;110(46):18734

Abstract

Recently developed methods have shown considerable promise in predicting residue-residue contacts in protein 3D structures using evolutionary covariance information. However, these methods require large numbers of evolutionarily related sequences to robustly assess the extent of residue covariation, and the larger the protein family, the more likely that contact information is unnecessary because a reasonable model can be built based on the structure of a homolog. Here we describe a method that integrates sequence coevolution and structural context information using a pseudolikelihood approach, allowing more accurate contact predictions from fewer homologous sequences. We rigorously assess the utility of predicted contacts for protein structure prediction using large and representative sequence and structure databases from recent structure prediction experiments. We find that contact predictions are likely to be accurate when the number of aligned sequences (with sequence redundancy reduced to 90%) is greater than five times the length of the protein, and that accurate predictions are likely to be useful for structure modeling if the aligned sequences are more similar to the protein of interest than to the closest homolog of known structure. These conditions are currently met by 422 of the protein families collected in the Pfam database.

Keywords: markov random field; maximum-entropy model; protein coevolution.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Accuracy of contact prediction. Comparison of GREMLIN with DCA (A), PSICOV (B), MIc (C), and GREMLIN when no prior information is used (D). Each point corresponds to a protein, the axes indicate the accuracy of the top ranked L/2 contacts predicted by the indicated methods. (E) (solid lines) Average accuracy for varying numbers of predictions; (broken lines) fraction of targets where GREMLIN was more accurate than the indicated method. Dependence of accuracy of the top L/5 (F) and L/2 (G) predictions on the alignment depth for a subset of 75 targets with deep alignments.
Fig. 2.
Fig. 2.
Improved contact prediction by integration coevolution and predicted structure-feature information. Accuracy of the top L/2 predictions between positions at least 12 residues apart, with and without priors on a dataset of 73 proteins that do not have homologs of known structure (Dataset S2; results between positions at least 24 residues apart are included in the SI Appendix, Fig. S7 A–D). (A) Using secondary structure and sequence separation priors, GREMLIN achieves higher accuracy than PLMDCA; (C) SVMCON and GREMLIN predictive accuracy are not highly correlated. (B and D) Integrating a Support Vector Machine (SVM) based prior into GREMLIN improves upon both methods alone.
Fig. 3.
Fig. 3.
Utility of contact prediction for structure modeling. (A) Ranking of alternate models by GREMLINΔ. Three scenarios are illustrated; each represents a distinct protein target, black dots indicate alternate models, red dots indicate native structures. (Left) GREMLINΔ is not useful in selecting the closest model and does not correctly discriminate between native (target pdb:4hwnA) and homology models; (Middle) GREMLINΔ ranks homology models correctly (top five models within 0.05 of best five on average; R2 between GREMLIN score and fraction of native contacts > 0.8) but adds no additional information (target pdb:4fn4D); (Right) GREMLINΔ discriminates between best model and native structure (target pdb:4hxtA). In an additional 6% of the targets, GREMLINΔ correctly discriminated the native from the homology models but there were not enough models to reliably establish accuracy of ranking. (B)formula image predicts GREMLINΔ: GREMLINΔ versus structural similarity of homolog to native structure computed by TM-align (14) (for homologs of all targets with high-resolution crystal structures < 2.1 Å). When formula image (blue bars), GREMLINΔ is rarely better than random (green bars, constructed by pooling 100 permutations of predicted scores for each target). When formula image (red bars), GREMLINΔ is significantly positive and contact scores successfully discriminate between native and homology model even when the homolog is likely to be from the same fold (similarity formula image). Error bars show mean and SD of distributions in all cases.
Fig. 4.
Fig. 4.
Frequency of utility of contact prediction. The protein families in the Pfam database were divided into three groups based on the HHsearch P value of the closest protein of known structure (Left, HHsearch P value > 10−6.5; Middle, HHsearch P value between 10−40 and 10−6.5; Right, HHsearch P value > 10−40). Within each group, the number of families with sequences/length less than 1, between 1 and 5, and greater than 5 are shown in blue, red, and green, respectively (Upper bars). For families with > 5 sequences per position (Upper green bars), distribution of formula image to the closest protein of known structure is shown in the lower panel. In cases where the difference in profiles is large (formula image: right bar in each group, Lower), these predictions are likely to improve on comparative models.

References

    1. Tress ML, Valencia A. Predicted residue–residue contacts can help the scoring of 3d models. Proteins. Struct Funct Bioinf. 2010;78(8):1980–1991. - PubMed
    1. Morcos F, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011;108(49):E1293–E1301. - PMC - PubMed
    1. Jones DT, Buchan DWA, Cozzetto D, Pontil M. PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28(2):184–190. - PubMed
    1. Marks DS, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE. 2011;6(12):e28766. - PMC - PubMed
    1. Balakrishnan S, Kamisetty H, Carbonell JG, Lee SI, Langmead CJ. Learning generative models for protein fold families. Protiens Struct Funct Bioinf. 2011;79(4):1061–1078. - PubMed

Publication types