Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul 21:10:228.
doi: 10.1186/1471-2105-10-228.

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

Affiliations

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

Kimberly Van Auken et al. BMC Bioinformatics. .

Abstract

Background: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts.

Results: We employ the Textpresso category-based information retrieval and extraction system (http://www.textpresso.org), developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed.

Conclusion: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Textpresso category development for Cellular Component curation. Curators identified true positive sentences from a training set and used word frequency analysis and manual inspection to identify words and phrases that were most indicative of experimentally determined subcellular localization. Three new categories, Cellular Components, Assay Terms, and Verbs, were created.
Figure 2
Figure 2
Sample true positive sentences from the training set. Three different sentences from the training set are shown [19-21], illustrating the types of sentences selected by curators and the individual terms selected for each of the categories. C. elegans proteins are shown in upper-case bold type, Cellular Components in blue, Assay Terms in red, and Verbs in green.
Figure 3
Figure 3
Textpresso-based curation is more efficient than manual curation. Three different curators recorded the amount of time it took to identify cellular component information from a set of 20 papers either read manually or searched via Textpresso using the three new cellular component categories. Textpresso-based curation results in an 8–15-fold improvement in curation efficiency depending upon the individual curator and the paper set.

References

    1. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J, et al. WormBase 2007. Nucleic Acids Research. 2008:D612–617. - PMC - PubMed
    1. Mulder N, Apweiler R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods in Molecular Biology (Clifton, NJ) 2007;396:59–70. - PubMed
    1. Huang CQ, Gasser RB, Cantacessi C, Nisbet AJ, Zhong W, Sternberg PW, Loukas A, Mulvenna J, Lin RQ, Chen N, et al. Genomic-Bioinformatic Analysis of Transcripts Enriched in the Third-Stage Larva of the Parasitic Nematode Ascaris suum. PLoS Neglected Tropical Diseases. 2008;2:e246. - PMC - PubMed
    1. Meng S, Brown DE, Ebbole DJ, Torto-Alalibo T, Oh YY, Deng J, Mitchell TK, Dean RA. Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae. BMC Microbiology. 2009;9:S8. - PMC - PubMed
    1. Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV. Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009;10:219. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources