. 2010 Jan;6(1):e1000633.

doi: 10.1371/journal.pcbi.1000633. Epub 2010 Jan 1.

Disentangling direct from indirect co-evolution of residues in protein alignments

Lukas Burger¹, Erik van Nimwegen

Affiliations

PMID: 20052271
PMCID: PMC2793430
DOI: 10.1371/journal.pcbi.1000633

Disentangling direct from indirect co-evolution of residues in protein alignments

Lukas Burger et al. PLoS Comput Biol. 2010 Jan.

. 2010 Jan;6(1):e1000633.

doi: 10.1371/journal.pcbi.1000633. Epub 2010 Jan 1.

Authors

Lukas Burger¹, Erik van Nimwegen

Affiliation

¹ Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, Basel, Switzerland.

PMID: 20052271
PMCID: PMC2793430
DOI: 10.1371/journal.pcbi.1000633

Abstract

Predicting protein structure from primary sequence is one of the ultimate challenges in computational biology. Given the large amount of available sequence data, the analysis of co-evolution, i.e., statistical dependency, between columns in multiple alignments of protein domain sequences remains one of the most promising avenues for predicting residues that are contacting in the structure. A key impediment to this approach is that strong statistical dependencies are also observed for many residue pairs that are distal in the structure. Using a comprehensive analysis of protein domains with available three-dimensional structures we show that co-evolving contacts very commonly form chains that percolate through the protein structure, inducing indirect statistical dependencies between many distal pairs of residues. We characterize the distributions of length and spatial distance traveled by these co-evolving contact chains and show that they explain a large fraction of observed statistical dependencies between structurally distal pairs. We adapt a recently developed Bayesian network model into a rigorous procedure for disentangling direct from indirect statistical dependencies, and we demonstrate that this method not only successfully accomplishes this task, but also allows contacts with weak statistical dependency to be detected. To illustrate how additional information can be incorporated into our method, we incorporate a phylogenetic correction, and we develop an informative prior that takes into account that the probability for a pair of residues to contact depends strongly on their primary-sequence distance and the amount of conservation that the corresponding columns in the multiple alignment exhibit. We show that our model including these extensions dramatically improves the accuracy of contact prediction from multiple sequence alignments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Statistical dependencies of structurally close and distal residue pairs.**
Left panel: Reverse-cumulative distribution of -values (horizontal axis) for structurally close (red) and distal (blue) residue pairs. Right panel: The fraction of all residue pairs that are distal in the structure as a function of their statistical dependency (-value).

formula image — **Figure 1. Statistical dependencies of structurally close and distal residue pairs.**
Left panel: Reverse-cumulative distribution of -values (horizontal axis) for structurally close (red) and distal (blue) residue pairs. Right panel: The fraction of all residue pairs that are distal in the structure as a function of their statistical dependency (-value).

**Figure 2. Statistical dependencies between pairs of residues reflect both direct and indirect interactions.**
The letters (A through E) represent residues and their distances in the figure reflect their distances in the three-dimensional structure. We assume that the pairs A–B, B–C, and D–E are in contact and interact directly. The thickness of the edges between pairs of nodes reflect the statistical dependencies between the corresponding columns in the multiple alignment.

**Figure 3. Illustration of a chain that explains the dependency between two distant residues and .**
The distance between the nodes illustrates the spatial separation and the thickness of the edges represents the strength of the dependence. Nodes and can be connected indirectly via a chain of contacts () through nodes and (in blue) whose edges all have higher dependency (i.e. , and ).

**Figure 4. Most distal co-evolving pairs can be explained by chains of co-evolving contacts.**
Left panel: Cumulative distributions for the number of distal pairs () that co-evolve () that can be explained by chains of co-evolving contacts as a function of the score of the best chain (see text). The blue line shows the distribution for the true data and the red curve for the randomized data. Right panel: Ratio (fold-enrichment) of the fraction of distal co-evolving pairs that can be explained by chains versus the fraction that can be explained by chains from the randomized data. The vertical axis is shown on a logarithmic scale.

**Figure 5. Statistics of co-evolving contact chains.**
Left panel: Reverse-cumulative distribution of the spatial distances between co-evolving pairs that can be explained by chains of co-evolving contacts of score . The vertical axis is shown on a logarithmic scale. The dotted line shows a fit to an exponential distribution . Right panel: Number of steps in the shortest co-evolving contact chain as a function of the spatial distance of the co-evolving pair. The blue line shows the mean distance and the red dotted lines show mean plus and minus one standard deviation. The black dotted line shows a linear fit, the fitted slope of which corresponds to an increase in distance by Å per additional contact in the chain.

**Figure 6. Illustration of the calculation of the posterior probability.**
For the sake of simplicity, we here show an example for an alignment with only columns. The posterior probability for edge is the statistical weight of all spanning trees that contain this edge relative to the weight of all possible spanning trees.

**Figure 7. Accuracy of contact predictions for all alignments.**
Shown are the performances of mutual information (black), (blue), and the posterior probabilities (red). The vertical axis shows mean positive predictive value (PPV, solid line) plus and minus one standard error (dashed lines) as a function of sensitivity (horizontal axis, shown on a logarithmic scale). The left panel shows predictions for all residue pairs, the middle using only predictions for residues separated by at least positions in the primary sequence, and the right panel for pairs separated by at least positions.

**Figure 8. Posteriors reflect the extent to which co-evolving pairs can be explained by contact chains.**
Shown are the reverse cumulative distributions of the posteriors of distal co-evolving pairs () that can be explained by contact chains of scores (red), (dark blue), (light blue), and for all distal co-evolving pairs (green). For comparison the reverse cumulative distribution of posteriors for co-evolving contacts () is also shown (black).

**Figure 9. The posterior predicts structurally close pairs independent of their direct statistical dependence.**
The structural distance distribution (vertical axis) is shown for all pairs (blue) and for pairs with posterior probability larger than (red) as a function of the -value of the statistic (horizontal axis). The solid lines show the medians of the distributions and the dashed lines the th and th percentiles.

**Figure 10. Improved accuracy of contact predictions when a phylogenetic correction is included.**
In blue, we show the performance of the phylogenetically-corrected posterior probabilities, in black the performance of the predictions based on the average-product corrected (APC) mutual information , and in red the performance of the posterior probability without phylogenetic correction. Curves were calculated as in figure 7.

**Figure 11. Occurrence of contacts and co-evolution as a function of primary sequence separation.**
Left panel: The fraction of residue pairs that are in contact in the structure as a function of primary sequence separation . The solid blue line shows the mean, the dashed blue lines the mean one standard error. The dashed black line shows the function . Right panel: The -value distribution of the statistics for all contacting pairs at different primary sequence separations. The blue line represents the median and the red lines represent the th, th, th and th percentiles, respectively. The -value was calculated with respect to the mean and standard deviation of the distribution of all pairs (including distal ones). In both panels only sequence separations up to residues are shown as the curves become very noisy for larger sequence separations.

**Figure 12. Contact-degree and co-evolution as a function of positional entropy.**
Left panel: Average number of contacts of a residue (solid line) as a function of the entropy of its alignment column. The dashed lines denote mean one standard error. The right panel shows the Z-value distribution of both (blue) and (red) for all contacting pairs versus the sum of entropies of the corresponding columns. The solid lines denote the medians and the dashed lines the 25th and 75th percentiles.

**Figure 13. Improved accuracy of contact prediction when an informative prior is included.**
In blue, we show the performance of the posterior probabilities that take primary-sequence separation and column entropy into account. For comparison we show in red the performance of the posteriors with phylogenetic correction but uniform prior, which are the same as the blue lines in Figure 10.

**Figure 14. Estimation of prior probabilities.**
The left panel shows the dependence between the fraction of pairs that are in contact and primary sequence separation for all pairs (in blue) as well as for pairs whose sum of entropies lies in a given entropy bin ( in red, in green, in black and in magenta). For the sake of clarity, only a few selected entropy bins across the entire range are shown. The right panel shows the estimated function , which describes how the probability of an edge to be a contact depends on the sum of entropies of the corresponding columns of the alignment (see text).

See this image and copyright information in PMC

References

1. Eddy S. Profile hidden markov models. Bioinformatics. 1998;14:755–763. - PubMed
1. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. Interpro: the integrative protein signature database. Nucleic Acids Res. 2009;35:D224–228. - PMC - PubMed
1. Eddy S, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Research. 1994;22(11):2079–2088. - PMC - PubMed
1. Lindgreen S, Gardner P, Krogh A. Measuring covariation in RNA alignments: physical realism improves information measures. Bioinformatics. 2006;22(24):2988–2995. - PubMed
1. Yanovsky C, Horn V, Thorpe D. Protein structure relationships revealed by mutational analysis. Science. 1964;146:1593–1594. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Disentangling direct from indirect co-evolution of residues in protein alignments

Affiliation

Disentangling direct from indirect co-evolution of residues in protein alignments

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources