Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Dec 13;5(12):e14286.
doi: 10.1371/journal.pone.0014286.

Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities

Affiliations

Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities

Eric Venner et al. PLoS One. .

Abstract

High-throughput Structural Genomics yields many new protein structures without known molecular function. This study aims to uncover these missing annotations by globally comparing select functional residues across the structural proteome. First, Evolutionary Trace Annotation, or ETA, identifies which proteins have local evolutionary and structural features in common; next, these proteins are linked together into a proteomic network of ETA similarities; then, starting from proteins with known functions, competing functional labels diffuse link-by-link over the entire network. Every node is thus assigned a likelihood z-score for every function, and the most significant one at each node wins and defines its annotation. In high-throughput controls, this competitive diffusion process recovered enzyme activity annotations with 99% and 97% accuracy at half-coverage for the third and fourth Enzyme Commission (EC) levels, respectively. This corresponds to false positive rates 4-fold lower than nearest-neighbor and 5-fold lower than sequence-based annotations. In practice, experimental validation of the predicted carboxylesterase activity in a protein from Staphylococcus aureus illustrated the effectiveness of this approach in the context of an increasingly drug-resistant microbe. This study further links molecular function to a small number of evolutionarily important residues recognizable by Evolutionary Tracing and it points to the specificity and sensitivity of functional annotation by competitive global network diffusion. A web server is at http://mammoth.bcm.tmc.edu/networks.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overview of ETA Network Diffusion.
1A. We detect similarities between proteins using Evolutionary Trace Annotation (ETA), which consists of three steps. First, the Evolutionary Trace (ET) algorithm ranks positions in aligned sequences by the correlation of their variations with evolutionary divergence. These ranks of evolutionary importance are mapped onto the protein structure. Second, six amino acids are selected heuristically based on their evolutionary importance, proximity and surface exposure, forming a structural template (red spheres). Third, the template is matched against proteins with known function. These steps are repeated for the matched proteins in order to verify that the match is reciprocal. Significant matches are selected by an SVM (not depicted). 1B. We construct a graph using ETA matches so that nodes represent protein chains and edges represent evolutionary and structural similarity. We select an enzymatic function and apply one of three labels to every node in the network: blue if the node is known to have that function, white if it is known to not have that function, or “?” if it is unknown whether or not the node has that function. We then allow these labels to “diffuse” to all other nodes in the network based on the strength and number of connections. This results in a weight assigned to every node for all enzymatic functions present in our network. In a final step (not depicted) we normalize the weights assigned to a particular node with respect to all other un-annotated nodes in the network. The normalized weights (called z-scores) are compared. The functional label with the highest z-score is taken as the prediction, and the magnitude of the z-score is used as a measure of confidence.
Figure 2
Figure 2. Performance on the FLORA test set.
The diffusion method shows a clear improvement at higher sensitivities.
Figure 3
Figure 3. 4 EC Performance on Structural Genomics test set.
3A. Accuracy/coverage tradeoffs of ETA network diffusion and nearest neighbors are shown in red and blue circles, respectively. Coverage (percentage of entire test set) increases as confidence decreases, so at 10% coverage we show the accuracy (# of true predictions/# of predictions made) of our 10% most confident predictions. Blue triangle shows performance of ETA. Diffusion gives clear accuracy advantages at most coverage cutoffs. 3B. Performance compared to the top match from a BLAST search of Swiss-prot. Diffusion on an ETA network clearly outperforms BLAST (black circles) at most coverages on this dataset, demonstrating the need for complementary structural based methods. 3C: Accuracies when the z score cutoff is varied. For each z score, we plot the accuracy of all predictions with that score or higher. Accuracy and z score show a positive correlation. Accuracy shows a steep decline after z = 0.4. 3D shows a magnified view of the beginning of the steep decline.
Figure 4
Figure 4. in vitro biochemical assay confirms the ETA network diffusion prediction of 3h04 as a carboxylesterase.
A) The prediction of carboxylesterase function for this unknown protein is based on ETA template matches to three chains, all of which have identical function and fold, and low sequence identity with the query protein. B) 10 µg of purfied 3h04 was run on a SDS-12% polyacrylamide gel and stained with Coomassie brilliant blue. The single band shown at 35 kDa corresponds to his-tagged 3h04. C) Plot of absorbance at 405 nm vs time for 3h04 (blue), esterase from porcine liver (Sigma, red), and BSA (Sigma, green). D) The specific activity of 3h04, 193±8 (blue), is similar to that of the esterase from porcine liver, 166±51 (Sigma,red). Specific activity is represented in Units (U) per mg of protein. All error bars depict standard deviation.
Figure 5
Figure 5. Performance penalty as edges are removed from a graph according to the sequence similarity of the nodes they connect for 4 EC predictions.
Accuracy/coverage tradeoffs of ETA network diffusion, nearest neighbor, and the top match from a BLAST search against Swiss-prot are shown in red, blue and black circles respectively. Coverage increases as confidence decreases, meaning at 10% coverage we show the accuracy of our 10% most confident predictions. Maximum allowed sequence identity is 80% in 3A, 60% in 3B, 40% in 3C and 20% is 3D. Accuracies decline with each removal, but ETA network diffusion maintains higher accuracy at high confidences/low coverage.
Figure 6
Figure 6. Network neighborhood of PDB structure 2dz9A.
Depicts the network neighborhood within 2 steps from structure 2dz9A. Structures in red are annotated as biotin—acetyl-CoA-carboxylase ligases (6.3.4.15). White structures have no function or are part of the test set. The nearest neighbor method leads to no prediction for 2dz9A because all matches are only to proteins without known function, but diffusion leads to a correct prediction because of the proximity to that functional label and high connectivity.

References

    1. Friedberg I. Automated protein function prediction--the genomic challenge. Brief Bioinform. 2006;7:225–242. - PubMed
    1. Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, et al. Towards Fully Automated Structure-based Function Prediction in Structural Genomics: A Case Study. Journal of Molecular Biology. 2007;367:1511–1522. - PMC - PubMed
    1. Chandonia J, Brenner SE. The Impact of Structural Genomics: Expectations and Outcomes. Science. 2006;311:347–351. - PubMed
    1. Hsiao T, Revelles O, Chen L, Sauer U, Vitkup D. Automatic policing of biochemical annotations using genomic correlations. Nat Chem Biol. 2010;6:34–40. - PMC - PubMed
    1. Brenner SE. Errors in genome annotation. Trends in Genetics. 1999;15:132–133. - PubMed

Publication types