Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan;6(1):e1000633.
doi: 10.1371/journal.pcbi.1000633. Epub 2010 Jan 1.

Disentangling direct from indirect co-evolution of residues in protein alignments

Affiliations

Disentangling direct from indirect co-evolution of residues in protein alignments

Lukas Burger et al. PLoS Comput Biol. 2010 Jan.

Abstract

Predicting protein structure from primary sequence is one of the ultimate challenges in computational biology. Given the large amount of available sequence data, the analysis of co-evolution, i.e., statistical dependency, between columns in multiple alignments of protein domain sequences remains one of the most promising avenues for predicting residues that are contacting in the structure. A key impediment to this approach is that strong statistical dependencies are also observed for many residue pairs that are distal in the structure. Using a comprehensive analysis of protein domains with available three-dimensional structures we show that co-evolving contacts very commonly form chains that percolate through the protein structure, inducing indirect statistical dependencies between many distal pairs of residues. We characterize the distributions of length and spatial distance traveled by these co-evolving contact chains and show that they explain a large fraction of observed statistical dependencies between structurally distal pairs. We adapt a recently developed Bayesian network model into a rigorous procedure for disentangling direct from indirect statistical dependencies, and we demonstrate that this method not only successfully accomplishes this task, but also allows contacts with weak statistical dependency to be detected. To illustrate how additional information can be incorporated into our method, we incorporate a phylogenetic correction, and we develop an informative prior that takes into account that the probability for a pair of residues to contact depends strongly on their primary-sequence distance and the amount of conservation that the corresponding columns in the multiple alignment exhibit. We show that our model including these extensions dramatically improves the accuracy of contact prediction from multiple sequence alignments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Statistical dependencies of structurally close and distal residue pairs.
Left panel: Reverse-cumulative distribution of formula image formula image-values (horizontal axis) for structurally close (red) and distal (blue) residue pairs. Right panel: The fraction of all residue pairs that are distal in the structure as a function of their statistical dependency (formula image-value).
Figure 2
Figure 2. Statistical dependencies between pairs of residues reflect both direct and indirect interactions.
The formula image letters (A through E) represent formula image residues and their distances in the figure reflect their distances in the three-dimensional structure. We assume that the pairs A–B, B–C, and D–E are in contact and interact directly. The thickness of the edges between pairs of nodes reflect the statistical dependencies between the corresponding columns in the multiple alignment.
Figure 3
Figure 3. Illustration of a chain that explains the dependency between two distant residues and .
The distance between the nodes illustrates the spatial separation and the thickness of the edges represents the strength of the dependence. Nodes formula image and formula image can be connected indirectly via a chain of contacts (formula image) through nodes formula image and formula image (in blue) whose edges all have higher dependency (i.e. formula image, formula image and formula image).
Figure 4
Figure 4. Most distal co-evolving pairs can be explained by chains of co-evolving contacts.
Left panel: Cumulative distributions for the number of distal pairs formula image (formula image) that co-evolve (formula image) that can be explained by chains of co-evolving contacts as a function of the score formula image of the best chain (see text). The blue line shows the distribution for the true data and the red curve for the randomized data. Right panel: Ratio (fold-enrichment) of the fraction of distal co-evolving pairs that can be explained by chains versus the fraction that can be explained by chains from the randomized data. The vertical axis is shown on a logarithmic scale.
Figure 5
Figure 5. Statistics of co-evolving contact chains.
Left panel: Reverse-cumulative distribution of the spatial distances between co-evolving pairs that can be explained by chains of co-evolving contacts of score formula image. The vertical axis is shown on a logarithmic scale. The dotted line shows a fit to an exponential distribution formula image. Right panel: Number of steps in the shortest co-evolving contact chain as a function of the spatial distance of the co-evolving pair. The blue line shows the mean distance and the red dotted lines show mean plus and minus one standard deviation. The black dotted line shows a linear fit, the fitted slope of which corresponds to an increase in distance by formula imageÅ per additional contact in the chain.
Figure 6
Figure 6. Illustration of the calculation of the posterior probability.
For the sake of simplicity, we here show an example for an alignment with only formula image columns. The posterior probability for edge formula image is the statistical weight of all spanning trees that contain this edge relative to the weight of all possible spanning trees.
Figure 7
Figure 7. Accuracy of contact predictions for all alignments.
Shown are the performances of mutual information (black), formula image (blue), and the posterior probabilities (red). The vertical axis shows mean positive predictive value (PPV, solid line) plus and minus one standard error (dashed lines) as a function of sensitivity (horizontal axis, shown on a logarithmic scale). The left panel shows predictions for all residue pairs, the middle using only predictions for residues separated by at least formula image positions in the primary sequence, and the right panel for pairs separated by at least formula image positions.
Figure 8
Figure 8. Posteriors reflect the extent to which co-evolving pairs can be explained by contact chains.
Shown are the reverse cumulative distributions of the posteriors of distal co-evolving pairs (formula image) that can be explained by contact chains of scores formula image (red), formula image (dark blue), formula image (light blue), and for all distal co-evolving pairs (green). For comparison the reverse cumulative distribution of posteriors for co-evolving contacts (formula image) is also shown (black).
Figure 9
Figure 9. The posterior predicts structurally close pairs independent of their direct statistical dependence.
The structural distance distribution (vertical axis) is shown for all pairs (blue) and for pairs with posterior probability larger than formula image (red) as a function of the formula image-value of the formula image statistic (horizontal axis). The solid lines show the medians of the distributions and the dashed lines the formula imageth and formula imageth percentiles.
Figure 10
Figure 10. Improved accuracy of contact predictions when a phylogenetic correction is included.
In blue, we show the performance of the phylogenetically-corrected posterior probabilities, in black the performance of the predictions based on the average-product corrected (APC) mutual information , and in red the performance of the posterior probability without phylogenetic correction. Curves were calculated as in figure 7.
Figure 11
Figure 11. Occurrence of contacts and co-evolution as a function of primary sequence separation.
Left panel: The fraction of residue pairs that are in contact in the structure as a function of primary sequence separation formula image. The solid blue line shows the mean, the dashed blue lines the mean formula image one standard error. The dashed black line shows the function formula image. Right panel: The formula image-value distribution of the formula image statistics for all contacting pairs at different primary sequence separations. The blue line represents the median and the red lines represent the formula imageth, formula imageth, formula imageth and formula imageth percentiles, respectively. The formula image-value was calculated with respect to the mean and standard deviation of the formula image distribution of all pairs (including distal ones). In both panels only sequence separations up to formula image residues are shown as the curves become very noisy for larger sequence separations.
Figure 12
Figure 12. Contact-degree and co-evolution as a function of positional entropy.
Left panel: Average number of contacts of a residue (solid line) as a function of the entropy of its alignment column. The dashed lines denote mean formula image one standard error. The right panel shows the Z-value distribution of both formula image (blue) and formula image (red) for all contacting pairs versus the sum of entropies of the corresponding columns. The solid lines denote the medians and the dashed lines the 25th and 75th percentiles.
Figure 13
Figure 13. Improved accuracy of contact prediction when an informative prior is included.
In blue, we show the performance of the posterior probabilities that take primary-sequence separation and column entropy into account. For comparison we show in red the performance of the posteriors with phylogenetic correction but uniform prior, which are the same as the blue lines in Figure 10.
Figure 14
Figure 14. Estimation of prior probabilities.
The left panel shows the dependence between the fraction of pairs that are in contact and primary sequence separation for all pairs (in blue) as well as for pairs whose sum of entropies lies in a given entropy bin (formula image in red, formula image in green, formula image in black and formula image in magenta). For the sake of clarity, only a few selected entropy bins across the entire range are shown. The right panel shows the estimated function formula image, which describes how the probability of an edge to be a contact depends on the sum of entropies of the corresponding columns of the alignment (see text).

Similar articles

Cited by

References

    1. Eddy S. Profile hidden markov models. Bioinformatics. 1998;14:755–763. - PubMed
    1. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. Interpro: the integrative protein signature database. Nucleic Acids Res. 2009;35:D224–228. - PMC - PubMed
    1. Eddy S, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Research. 1994;22(11):2079–2088. - PMC - PubMed
    1. Lindgreen S, Gardner P, Krogh A. Measuring covariation in RNA alignments: physical realism improves information measures. Bioinformatics. 2006;22(24):2988–2995. - PubMed
    1. Yanovsky C, Horn V, Thorpe D. Protein structure relationships revealed by mutational analysis. Science. 1964;146:1593–1594. - PubMed

Publication types