Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 12:2014:bau058.
doi: 10.1093/database/bau058. Print 2014.

Curation accuracy of model organism databases

Affiliations

Curation accuracy of model organism databases

Ingrid M Keseler et al. Database (Oxford). .

Abstract

Manual extraction of information from the biomedical literature-or biocuration-is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org//

PubMed Disclaimer

References

    1. Price M.N., Huang K.H., Alm E.J., et al. . (2005) A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. , 33, 880–892 - PMC - PubMed
    1. Gordon L., Chervonenkis A.Y., Gammerman A.J., et al. . (2003) Sequence alignment kernel for recognition of promoter regions. Bioinformatics , 19, 1964–1971 - PubMed
    1. Muley V.Y., Ranjan A. (2012) Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction. PLoS One , 7, e42057. - PMC - PubMed
    1. Salwinski L., Licata L., Winter A., et al. . (2009) Recurated protein interaction datasets. Nat. Methods , 6, 860–861 - PubMed
    1. Cusick M.E., Yu H., Smolyar A., et al. . (2009) Literature-curated protein interaction datasets. Nat. Methods , 6, 39–46 - PMC - PubMed

Publication types

LinkOut - more resources