Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun 1;5(6):e10779.
doi: 10.1371/journal.pone.0010779.

Validation of coevolving residue algorithms via pipeline sensitivity analysis: ELSC and OMES and ZNMI, oh my!

Affiliations

Validation of coevolving residue algorithms via pipeline sensitivity analysis: ELSC and OMES and ZNMI, oh my!

Christopher A Brown et al. PLoS One. .

Abstract

Correlated amino acid substitution algorithms attempt to discover groups of residues that co-fluctuate due to either structural or functional constraints. Although these algorithms could inform both ab initio protein folding calculations and evolutionary studies, their utility for these purposes has been hindered by a lack of confidence in their predictions due to hard to control sources of error. To complicate matters further, naive users are confronted with a multitude of methods to choose from, in addition to the mechanics of assembling and pruning a dataset. We first introduce a new pair scoring method, called ZNMI (Z-scored-product Normalized Mutual Information), which drastically improves the performance of mutual information for co-fluctuating residue prediction. Second and more important, we recast the process of finding coevolving residues in proteins as a data-processing pipeline inspired by the medical imaging literature. We construct an ensemble of alignment partitions that can be used in a cross-validation scheme to assess the effects of choices made during the procedure on the resulting predictions. This pipeline sensitivity study gives a measure of reproducibility (how similar are the predictions given perturbations to the pipeline?) and accuracy (are residue pairs with large couplings on average close in tertiary structure?). We choose a handful of published methods, along with ZNMI, and compare their reproducibility and accuracy on three diverse protein families. We find that (i) of the algorithms tested, while none appear to be both highly reproducible and accurate, ZNMI is one of the most accurate by far and (ii) while users should be wary of predictions drawn from a single alignment, considering an ensemble of sub-alignments can help to determine both highly accurate and reproducible couplings. Our cross-validation approach should be of interest both to developers and end users of algorithms that try to detect correlated amino acid substitutions.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Improving upon mutual information by removing column bias.
Mutual information and normalized mutual information is shown for the PDZ dataset. A. The distribution of mutual information is shown for each column in the multiple sequence alignment. As can be seen, mutual information is highly correlated to both the product of the mean column mutual information (scatter plot, upper inset) and the product of the standard deviation of column mutual information (scatter plot, lower inset). B. The distribution of normalized mutual information (i.e. mutual information normalized by joint entropy) is shown for each column in the multiple sequence alignment. The normalization reduces both the correlation between the product of the mean column mutual information (scatter plot, upper inset) and the product of the standard deviation of column mutual information (scatter plot, lower inset), but doesn't remove it entirely. C. ZNMI approximates the column normalized MI distributions (solid red line and solid blue line) as Gaussian distributions (dashed red line and dashed blue line), calculates a closed-form expression for the product of the two distributions (solid green line: kernel density estimate of product), and then z-scores the normalized mutual information (black solid vertical line) based on the Gaussian approximation of the product (dashed green line).
Figure 2
Figure 2. Overview of the statistical pipeline.
Determining intra/inter-protein coevolving residues can be thought of as a complex, mulit-step optimization process. Initial sequences, as many as possible, are collected for a protein of interest (Sequence Retrieval). The sequences are pruned by similarity and length in order to filter the starting dataset of sequence fragments and sequences that heavily bias the phylogeny (Preprocessing). The sequences are then aligned by available methods, and many independent disjoint splits of the dataset are made so that half of the aligned sequences are in one split and the other half are in the other split (Alignment & Partition). From this point on the two splits of the data are processed equivalently. A coevolving residue algorithm is then used to convert a split of the data (sub-alignment) into a correlation matrix that can be analyzed as an undirected weighted graph (Network). The resulting graph can then be pruned to remove insignificant edges or highly gapped columns (Pruning & Cutoffs). Finally, the independent splits are compared and result in measures of accuracy and reproducibility (Reproducibility & Accuracy).
Figure 3
Figure 3. Reproducibilty and accuracy for published algorithms on three different families.
Scatterplots and histograms of reproducibility and accuracy for the three protein families (PDZ, 1256 sequences, CS, 765 sequences, GPCR, 2476 sequences) we consider in the text. The methods are Random (red), MI (green), old SCA (yellow), new SCA (black), OMES (cyan), ELSC (magenta), and ZNMI (blue). The top row shows the results when we construct the consensus network using MST, and the bottom with TNm1. The y axes on the reproducibility histograms have been rescaled to allow better visualization of the shapes of the distributions. The line colors shown in the GPCR MST panel are used consistently throughout. The old version of SCA often produces accuracies below that of random (near zero, left side of each plot); see the text for further discussion on this point.
Figure 4
Figure 4. Weights in the consensus networks decay dramatically.
A. The largest connected component in the MST consensus network, for the full CS dataset, as a function of edge weight cutoff is shown. For all of the edge scoring methods considered, but particularly ZNMI and oSCA, use of MSTs to construct the consensus network results in small, disconnected clusters when the consensus network is relatively mildly pruned. Directly above the plot, heatmaps are displayed for the Jaccard index (all methods vs. all methods, excluding Rand) for three points along the curve (0.25, 0.5, and 0.75). As the network is pruned, the Jaccard indices generally remain the same with only slight increases in overlap between methods (note: ZNMI and ELSC at a cutoff of 0.75). Note that the colorscale is given not in terms of the actual Jaccard index but the percent similarity between the two sets of edges (see “Methods”). B. Cutting the graph with increasing edge weight results in edges that are in fact closer in tertiary structure, as measured by their mean formula image distance. Directly above the plot, the consensus graph is shown at three different edge frequency cutoffs. Note the dramatic transition in the consensus graph between a weight of 0.25 and 0.5; simply removing edges which co-occur less than 50% of the time results in a network consisting primary of small, disjoint clusters. Notice also that even at a cutoff of 0.75, many nontrivial clusters (beyond simple pairs) remain in the network.
Figure 5
Figure 5. Consensus communities at 90% reproducibility mapped to the PDZ tertiary structure.
The upper left panel shows the consensus network for the PDZ dataset at a reproducibility cutoff of 90%. The remaining three panels give three views of the consensus networks mapped to our chosen canonical PDZ structure (PDB Identifier: 1IU0). The color coding in the upper left panel is identical when considering the structures. While some of the consensus co-fluctuating groups are quite close in sequence (orange and dark blue), others (cyan) are quite far away. A closer look at the red and dark purple clusters is given in Figure 6. For this figure and Figures 6– 10, ZNMI is the pair scoring method and MSTs were used to construct the consensus networks.
Figure 6
Figure 6. Two disjoint but intertwined communties mapped to the PDZ tertiary structure.
Shown here is a closeup of the red and purple clusters from Figure 5. These two communities are disjoint at this cutoff (90%) and on opposite sides of the pictured helix. Also of note is that they have a periodicity of three in sequence, not four residues as would be the case with residues interacting through the turns of an formula image-helix.
Figure 7
Figure 7. Consensus communities at 90% reproducibility mapped to the CS tertiary structure.
The upper left panel shows the consensus network for the CS dataset, again at a reproducibility cutoff of 90%. Several of these communities have been colored in and mapped to the canonical structure (PDB Identifier: 1R52); the color code is consistent between the networks and the structural views. The magenta community is considered more closely in Figure 8.
Figure 8
Figure 8. Small community in the CS consensus network highlights a dimerization interface.
Here we show a closeup view of the CS structure from Figure 7 and the network colored in magenta. Viewed on a single copy of the CS structure, the magenta community seems to be meaningless. However, when CS dimerization is considered, the magenta community shows its role as a key set of residues mitigating an inter-subunit coupling. Also of note is that the residue topology in the consensus network exactly mimics their minimum distance topology in the tertiary structure.
Figure 9
Figure 9. Consensus communities at 90% reproducibility mapped to the GPCR tertiary structure.
The upper left panel shows the consensus network for the GPCR dataset at 90% reproducibility, and the remaining panels show selected communities mapped onto the canonical structure (PDB Identifier: 2VT4). The consensus network here was computed from a 1000-sequence GPCR dataset because it was more accurate than the full 2476 sequences (an average of 10 angstroms vs 15 angstroms).
Figure 10
Figure 10. One large community from the consensus network at 90% reproducibility mapped to the GPCR tertiary structure.
We show here an enlargement of the magenta community from Figure 9. The inset shows the cluster along with two boxes highlighting two portions of the community. Note that this cluster shows significant coupling at large physical distances; the four residues outlined in the inset are at the top of the figure and the other two outlined residues are at bottom.

References

    1. Perez-Iratxeta C, Palidwor G, Andrade-Navarro M. Towards completion of the earth's proteome. EMBO Reports. 2007;8:1135–1141. - PMC - PubMed
    1. Martin L, Gloor G, Dunn S, Wahl L. Using information theory to search for co-evolving residues in proteins. Bioinformatics. 2005;21:4116–4124. - PubMed
    1. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW. Correlations among amino acid sites in bhlh protein domains: an information theoretic analysis. Molecular Biology and Evolution. 2000;17:164–178. - PubMed
    1. Horner D, Pirovano W, Pesole G. Correlated substitution analysis and the prediction of amino acid structural contacts. Briefings in Bioinformatics. 2007;9:46–56. - PubMed
    1. Ashkenazy H, Unger R, Kliger Y. Optimal data collection for correlated mutation analysis. Proteins. 2009;74:545–555. - PubMed

Publication types