Annotation error in public databases: misannotation of molecular function in enzyme superfamilies
- PMID: 20011109
- PMCID: PMC2781113
- DOI: 10.1371/journal.pcbi.1000605
Annotation error in public databases: misannotation of molecular function in enzyme superfamilies
Abstract
Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures






Similar articles
-
CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences.BMC Bioinformatics. 2007 Apr 19;8:129. doi: 10.1186/1471-2105-8-129. BMC Bioinformatics. 2007. PMID: 17445272 Free PMC article.
-
Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.Proc Int Conf Intell Syst Mol Biol. 1997;5:33-43. Proc Int Conf Intell Syst Mol Biol. 1997. PMID: 9322012
-
The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program.J Proteomics. 2009 Apr 13;72(3):567-73. doi: 10.1016/j.jprot.2008.11.010. Epub 2008 Nov 24. J Proteomics. 2009. PMID: 19084081 Free PMC article.
-
The Swiss-Prot protein knowledgebase and ExPASy: providing the plant community with high quality proteomic data and tools.Plant Physiol Biochem. 2004 Dec;42(12):1013-21. doi: 10.1016/j.plaphy.2004.10.009. Epub 2004 Dec 15. Plant Physiol Biochem. 2004. PMID: 15707838 Review.
-
The annotation of both human and mouse kinomes in UniProtKB/Swiss-Prot: one small step in manual annotation, one giant leap for full comprehension of genomes.Mol Cell Proteomics. 2008 Aug;7(8):1409-19. doi: 10.1074/mcp.R700001-MCP200. Epub 2008 Apr 24. Mol Cell Proteomics. 2008. PMID: 18436524 Free PMC article. Review.
Cited by
-
Protein function prediction by massive integration of evolutionary analyses and multiple data sources.BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S1. doi: 10.1186/1471-2105-14-S3-S1. Epub 2013 Feb 28. BMC Bioinformatics. 2013. PMID: 23514099 Free PMC article.
-
Automatic assignment of prokaryotic genes to functional categories using literature profiling.PLoS One. 2012;7(10):e47436. doi: 10.1371/journal.pone.0047436. Epub 2012 Oct 15. PLoS One. 2012. PMID: 23077617 Free PMC article.
-
Global probabilistic annotation of metabolic networks enables enzyme discovery.Nat Chem Biol. 2012 Oct;8(10):848-54. doi: 10.1038/nchembio.1063. Nat Chem Biol. 2012. PMID: 22960854 Free PMC article.
-
The enzymatic nature of an anonymous protein sequence cannot reliably be inferred from superfamily level structural information alone.Protein Sci. 2015 May;24(5):643-50. doi: 10.1002/pro.2635. Epub 2015 Jan 28. Protein Sci. 2015. PMID: 25559918 Free PMC article.
-
Activity-based protein profiling of rice (Oryza sativa L.) bran serine hydrolases.Sci Rep. 2020 Sep 16;10(1):15191. doi: 10.1038/s41598-020-72002-w. Sci Rep. 2020. PMID: 32938958 Free PMC article.
References
-
- Bork P, Bairoch A. Go hunting in sequence databases but watch out for the traps. Trends Genet. 1996;12:425–427. - PubMed
-
- Karp PD. What we do not know about sequence analysis and sequence databases. Bioinformatics. 1998;14:753–754. - PubMed
-
- Brenner SE. Errors in genome annotation. Trends Genet. 1999;15:132–133. - PubMed
-
- Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. 2001;17:429–431. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources