Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep;32(9):2456-68.
doi: 10.1093/molbev/msv109. Epub 2015 May 4.

Covariation Is a Poor Measure of Molecular Coevolution

Affiliations

Covariation Is a Poor Measure of Molecular Coevolution

David Talavera et al. Mol Biol Evol. 2015 Sep.

Abstract

Recent developments in the analysis of amino acid covariation are leading to breakthroughs in protein structure prediction, protein design, and prediction of the interactome. It is assumed that observed patterns of covariation are caused by molecular coevolution, where substitutions at one site affect the evolutionary forces acting at neighboring sites. Our theoretical and empirical results cast doubt on this assumption. We demonstrate that the strongest coevolutionary signal is a decrease in evolutionary rate and that unfeasibly long times are required to produce coordinated substitutions. We find that covarying substitutions are mostly found on different branches of the phylogenetic tree, indicating that they are independent events that may or may not be attributable to coevolution. These observations undermine the hypothesis that molecular coevolution is the primary cause of the covariation signal. In contrast, we find that the pairs of residues with the strongest covariation signal tend to have low evolutionary rates, and that it is this low rate that gives rise to the covariation signal. Slowly evolving residue pairs are disproportionately located in the protein's core, which explains covariation methods' ability to detect pairs of residues that are close in three dimensions. These observations lead us to propose the "coevolution paradox": The strength of coevolution required to cause coordinated changes means the evolutionary rate is so low that such changes are highly unlikely to occur. As modern covariation methods may lead to breakthroughs in structural genomics, it is critical to recognize their biases and limitations.

Keywords: coevolution; covariation; molecular evolution.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Different evolutionary scenarios for two binary sites. Each site can have {0,1} states. Thus, the pair of sites can have {00,01,10,11} states. We show in bold font the observed states at the leaves of the tree, and in italics font the states after the first split (all scenarios are identical up to that point). Disks represent unobserved true evolutionary changes: Half-disks are single changes; full-disks are double changes. Also shown is the MI for each pair of scenarios. No other covariation measures were included because they are meaningless in two-site scenarios.
F<sc>ig</sc>. 2.
Fig. 2.
Effect of coevolutionary selective pressure, S, in a binary model on the relative rate of coevolution (gray line), and the relative frequency of single observable changes at different evolutionary times, t, in units of expected numbers of substitutions (black dashed lines).
F<sc>ig</sc>. 3.
Fig. 3.
Information and parsimony basis of methods when analyzing a “well-defined” phylogenetic scenario. (A, B) Median MI for the selected pairs. (C, D) Median number of single changes occurring in the selected pairs. (E, F) Median number of double changes occurring in the selected pairs. Lines are as follows: Black-dashed, MI; red, χ2; yellow, MIH(XY); green, MIp; cyan, MIadj; blue, PSICOV; purple, DI. Shaded area shows the confidence interval of the expected value for a specific number of predictions.
F<sc>ig</sc>. 4.
Fig. 4.
Information and parsimony basis of methods when analyzing big data sets. For each alignment in the benchmark data set, 1) we calculated MI, and the number of single and double changes for all the pairs; and 2) we calculated the median value for the selected pairs, and the median value for a control sample. (A) Median MI of the selected pairs compared with the median MI for the top-informative pairs (equivalent-size set of pairs with the highest MI) for the proteins in the PSICOV benchmark. (B) Median number of single changes occurring in the selected pairs compared with the median number of single changes occurring in the rest of pairs for the proteins in the PSICOV benchmark. (C) Median number of double changes occurring in the selected pairs compared with the median number of double changes occurring in the rest of pairs for the proteins in the PSICOV benchmark.
F<sc>ig</sc>. 5.
Fig. 5.
Evolutionary basis of methods. In the “well-defined” phylogenetic scenarios, we calculated the mean rate for each pair as the average of the two single evolutionary rates. Then, we calculated the median of the sample of mean rates. (A, B) Median of the averaged evolutionary rate for the selected pairs. Lines are as follows: Black-dashed, MI; red, χ2; yellow, MIH(XY); green, MIp; cyan, MIadj; blue, PSICOV; purple, DI. Shaded area shows the expected mean rate for a specific number of predictions. (C) Median entropy H of the selected pairs compared with the median H for the top-entropic pairs for the proteins in the PSICOV benchmark.
F<sc>ig</sc>. 6.
Fig. 6.
Precision of evolution-based metrics. Lines are as follows: Black-dashed, best covariation performance; red, mean of evolutionary rates; yellow, difference (variance) between evolutionary rates; green, MPind; blue, MPdep; purple, branches with single substitutionsbranches with double substitutions. Shaded area shows the random precision for a specific number of predictions.
F<sc>ig</sc>. 7.
Fig. 7.
Structural basis of methods. (A, B) Median of the weighted average accessibility of the selected pairs. Lines are as follows: Black-dashed, MI; red, χ2; yellow, MIH(XY); green, MIp; cyan, MIadj; blue, PSICOV; purple, DI. Shaded area shows the expected mean accessibility for a specific number of predictions. (C) Median weighted average accessibility in the selected pairs compared with the median weighted average accessibility in the rest of pairs for the proteins in the PSICOV benchmark.
F<sc>ig</sc>. 8.
Fig. 8.
Diagram summarizing the data processing pipeline for the “well-defined” phylogenetic data sets. We iteratively aligned sequences (blue box), and calculated the ML evolutionary tree (orange box) in order to remove distant homologs and conflicting sites. We had different checkpoints (CK) in order to filter the sequences and sites. The “well-defined” scenario consists in the final alignment and phylogenetic tree.

References

    1. Ackerman SH, Tillier ER, Gatti DL. 2012. Accurate simulation and detection of coevolution signals in multiple sequence alignments. PLoS One 7(10):e47108. - PMC - PubMed
    1. Ashenberg O, Laub MT. 2013. Using analyses of amino acid coevolution to understand protein structure and function. Methods Enzymol. 523:191–212. - PubMed
    1. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW. 2000. Correlations among amino acid sites in bhlh protein domains: an information theoretic analysis. Mol Biol Evol. 17(1):164–178. - PubMed
    1. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. 2002. Analysis of catalytic residues in enzyme active sites. J Mol Biol. 324(1):105–121. - PubMed
    1. Brown CA, Brown KS. 2010. Validation of coevolving residue algorithms via pipeline sensitivity analysis: Elsc and omes and znmi, oh my! PLoS One 5(6):e10779. - PMC - PubMed

Publication types