. 2012 Nov 22:2012:bas037.

doi: 10.1093/database/bas037. Print 2012.

Collaborative biocuration--text-mining development task for document prioritization for curation

Thomas C Wiegers¹, Allan Peter Davis, Carolyn J Mattingly

Affiliations

PMID: 23180769
PMCID: PMC3504477
DOI: 10.1093/database/bas037

Collaborative biocuration--text-mining development task for document prioritization for curation

Thomas C Wiegers et al. Database (Oxford). 2012.

. 2012 Nov 22:2012:bas037.

doi: 10.1093/database/bas037. Print 2012.

Authors

Thomas C Wiegers¹, Allan Peter Davis, Carolyn J Mattingly

Affiliation

¹ Department of Biology, North Carolina State University, Raleigh, NC 27695-7617, USA. tcwieger@ncsu.edu

PMID: 23180769
PMCID: PMC3504477
DOI: 10.1093/database/bas037

Abstract

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. The 'BioCreative Workshop 2012' subcommittee identified three areas, or tracks, that comprised independent, but complementary aspects of data curation in which they sought community input: literature triage (Track I); curation workflow (Track II) and text mining/natural language processing (NLP) systems (Track III). Track I participants were invited to develop tools or systems that would effectively triage and prioritize articles for curation and present results in a prototype web interface. Training and test datasets were derived from the Comparative Toxicogenomics Database (CTD; http://ctdbase.org) and consisted of manuscripts from which chemical-gene-disease data were manually curated. A total of seven groups participated in Track I. For the triage component, the effectiveness of participant systems was measured by aggregate gene, disease and chemical 'named-entity recognition' (NER) across articles; the effectiveness of 'information retrieval' (IR) was also measured based on 'mean average precision' (MAP). Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%. Each participating group also developed a prototype web interface; these interfaces were evaluated based on functionality and ease-of-use by CTD's biocuration project manager. In this article, we present a detailed description of the challenge and a summary of the results.

PubMed Disclaimer

Figures

**Figure 1**
The BioCreative Track I File Upload Facility. A web interface was developed to allow participants to upload their results (back panel). Following successful uploads, a report was generated and returned to each participant that contained summary or detailed information for each dataset; a summary report is shown.

**Figure 2**
MAP (9) score results for each participating group. For MAP score calculations, an article was counted as relevant if it had one or more associated curated interactions. Across the groups, MAP scores were fairly high and consistent, ranging from 71% to 80%.

**Figure 3**
Gene recall results for each participating group. The ability for text-mining tools to recognize curated genes was measured; terms and synonyms to terms were counted as matches. Gene recall ranged from 2% to 49%.

**Figure 4**
Chemical recall results for each participating group. The ability for text-mining tools to recognize curated chemicals was measured; terms and synonyms to terms were counted as matches. Chemical recall ranged from 5% to 82%.

**Figure 5**
Disease recall results for each participating group. The ability for text-mining tools to recognize curated diseases was measured; terms and synonyms to terms were counted as matches. Disease recall ranged from <1% to 65%.

**Figure 6**
Aggregate metrics for each participating group. The results of MAP (9) scores and chemical, gene, disease and action term recall scores are aggregated onto a single bar graph for each participating group. Two of the groups clearly distinguished themselves with respect to aggregate benchmarking results. Group 121 held the highest MAP score (80%) while also delivering strong recall scores in the three major recall categories (chemicals, genes and diseases). Group 116 delivered the highest recall scores in two of the three major data categories (i.e. gene and disease recall). Three other groups (120, 139 and 130) had respectable recall scores in most, if not all, of the major data categories.

**Figure 7**
(a) Group 121 web interface. A screenshot of Group 121’s ranked list of chemicals for curation in their web interface. (b) A screenshot of Group 121’s curation detail page in their web interface. (c) Screenshots of two of Group 121’s data management-related pages in their web interface.

**Figure 8**
(a) Group 116 web interface. A screenshot of Group 116’s Concepts tab in their web interface. (b) A screenshot of Group 116’s Interactions tab in their web interface. (c) A screenshot of Group 116’s Terms tab in their web interface.

**Figure 9**
Group 133 web interface. A screenshot of Group 133’s web interface.

**Figure 10**
Group 120 web interface. A screenshot of Group 120’s web interface.

**Figure 11**
Group 139 web interface. A screenshot of Group 139’s web interface.

**Figure 12**
Group 130 web interface. A screenshot of Group 130’s web interface.

**Figure 13**
Group 141 web interface. A screenshot of Group 141’s web interface.

See this image and copyright information in PMC

References

1. Davis AP, King BL, Mockus S, Murphy CG, Saraceni-Richards C, Rosenstein M, Wiegers T, Mattingly CJ. The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res. 2011;39:D1067–D1072. - PMC - PubMed
1. Davis AP, Wiegers TC, Murphy CG, Mattingly CJ. The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database. Database. 2011;2011:bar034. - PMC - PubMed
1. Davis AP, Murphy CG, Saraceni-Richards C, Rosenstein M, Wiegers T, Mattingly CJ. The Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 2009;37:D786–792. - PMC - PubMed
1. Coletti MH, Bleich HL. Medical Subject Headings used to search the biomedical literature. J. Am. Med. Inform. Assoc. 2001;8:317–323. - PMC - PubMed
1. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez gene: Gene-centered information at ncbi. Nucleic Acids Res. 2011;39:D52–D57. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Collaborative biocuration--text-mining development task for document prioritization for curation

Affiliation

Collaborative biocuration--text-mining development task for document prioritization for curation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources