High precision multi-genome scale reannotation of enzyme function by EFICAz

Adrian K Arakaki¹, Weidong Tian, Jeffrey Skolnick

Affiliations

Affiliation

¹ Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

PMID: 17166279
PMCID: PMC1764738
DOI: 10.1186/1471-2164-7-315

High precision multi-genome scale reannotation of enzyme function by EFICAz

Adrian K Arakaki et al. BMC Genomics. 2006.

. 2006 Dec 13:7:315.

doi: 10.1186/1471-2164-7-315.

Authors

Adrian K Arakaki¹, Weidong Tian, Jeffrey Skolnick

Affiliation

¹ Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA. adrian.arakaki@gatech.edu <adrian.arakaki@gatech.edu>

PMID: 17166279
PMCID: PMC1764738
DOI: 10.1186/1471-2164-7-315

Abstract

Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.

Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12).

Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.

PubMed Disclaimer

Figures

**Figure 1**
**Enzyme content in organisms from the three domains of life**. Number of enzymes as a function of the proteome size for archaeal (A), bacterial (B) and eukaryotic (C) genomes. The gray, magenta and green lines represent: regression line, 95% and 99% prediction intervals, respectively. (D) Distribution of the fraction of enzymes in archaeal, bacterial and eukaryotic genomes. The statistics represented in the box-and-whisker plots are: outliers below the 10th percentile (circles, bottom), 10th percentile (whisker, bottom), 25th percentile (box, bottom), median (thick line), 75th percentile (box, top), 90th percentile (whisker, top) and outliers above 90th percentile (circles, top).

**Figure 2**
**Comparison of EFICAz predictions with KEGG annotations**. Comparison of EFICAz predictions with KEGG annotations from the *Genes* database of March 5, 2005, Release 33.0+/03–5 (**A-B**) and of March 7, 2006, Release 37.0+/03–07 (**C-D**). We analyze two levels of enzyme function description: four-field EC numbers (**A, C**) and three-field EC numbers **(B, D)**. For all, archaeal, bacterial and eukaryotic genomes we plot the average percentage of enzymatic proteins per genome whose EFICAz-inferred and KEGG-provided annotations at the specified level of detail agree (green columns) or disagree (red columns), and whose enzyme function annotation at the specified level of detail is only provided by EFICAz (blue columns) or by KEGG (yellow columns). The numeric values inserted in each stacked column are the corresponding average percentage of enzymatic proteins per genome +/- the standard deviation.

**Figure 3**
**Similarity of 64 previously hypothetical proteins to EFICAz training enzymes**. Number of previously hypothetical proteins predicted to be enzymes by EFICAz at different intervals of maximal sequence identity to enzymes included in the EFICAz version 5.0 training set. The true enzyme function of these 64 previously hypothetical proteins has been recently determined; therefore, we could assess the precision of our predictions. Dark green, light green and red bars represent four field EC number predictions with four, three or less than three correct EC fields, respectively. Yellow and orange bars represent three field EC number predictions with three or less than three correct EC fields, respectively. The median of the distribution (24.8%) is indicated by the broken line.

**Figure 4**
**Benchmark test of updated versions of EFICAz**. Precision (**A-C**), recall (**D-F**) and number of enzyme types described by four-field EC numbers (**G-I**) for different versions of EFICAz, at different levels of maximal testing to training sequence identity, averaged per enzyme type. Curves in red correspond to enzyme types for which at least 10 training sequences were available; curves in blue correspond to all enzyme types. The training of versions 2.0, 3.0 and 4.0 of EFICAz is based on the Releases 2.0, 3.0 and 4.0 of UniProt, respectively. The new Swiss-Prot sequences added to UniProt 5.0 since the release of UniProt 2.0, 3.0 and 4.0 constitute the test sequences for versions 2.0, 3.0 and 4.0 of EFICAz. See Methods for a full description of the benchmark procedure.

See this image and copyright information in PMC

References

1. White RH. The difficult road from sequence to function. J Bacteriol. 2006;188:3431–3432. doi: 10.1128/JB.188.10.3431-3432.2006. - DOI - PMC - PubMed
1. Friedberg I. Automated protein function prediction--the genomic challenge. Brief Bioinform. 2006 - PubMed
1. Ouzounis CA, Karp PD. The past, present and future of genome-wide re-annotation. Genome Biol. 2002;3:COMMENT2001. doi: 10.1186/gb-2002-3-2-comment2001. - DOI - PMC - PubMed
1. Bork P, Koonin EV. Predicting functions from protein sequences--where are the bottlenecks? Nat Genet. 1998;18:313–318. doi: 10.1038/ng0498-313. - DOI - PubMed
1. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein function. Cell Mol Life Sci. 2003;60:2637–2650. doi: 10.1007/s00018-003-3114-8. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

High precision multi-genome scale reannotation of enzyme function by EFICAz

Affiliation

High precision multi-genome scale reannotation of enzyme function by EFICAz

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials