Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

Aurélie Névéol¹, Rezarta Islamaj Doğan, Zhiyong Lu

Affiliations

PMID: 21094696
PMCID: PMC3063330
DOI: 10.1016/j.jbi.2010.11.001

Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

Aurélie Névéol et al. J Biomed Inform. 2011 Apr.

. 2011 Apr;44(2):310-8.

doi: 10.1016/j.jbi.2010.11.001. Epub 2010 Nov 20.

Authors

Aurélie Névéol¹, Rezarta Islamaj Doğan, Zhiyong Lu

Affiliation

¹ National Center for Biotechnology Information, US National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.

PMID: 21094696
PMCID: PMC3063330
DOI: 10.1016/j.jbi.2010.11.001

Abstract

Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.

Published by Elsevier Inc.

PubMed Disclaimer

Figures

**Figure 1**
The sixteen category annotation scheme used to annotate PubMed queries is displayed on the left side of the Knowtator screen shot. A sample annotated query from the corpus is shown in the middle. Note that categories Devices and Living Beings correspond to the eponym UMLS Semantic Groups while categories Body Part, Cell or Cell Component and Tissue correspond to UMLS Semantic Types. Specific definitions and examples for all categories are given in Table 1 and in the supplementary data.

**Figure 2**
Comparison of annotation time in minutes (a) overall and (b) for the seven annotators as the annotation task progresses. Full circles represent batches with pre-annotations while hollow circles represent batches without pre-annotations. Note that all the batches with pre-annotation appear between the vertical dashed lines.

**Figure 3**
Comparison of the number of actions (a) overall and (b) for the seven annotators as a function of annotation time. Full circles represent batches with pre-annotations while hollow circles represent batches without pre-annotations.

**Figure 4**
Comparison of (a) final number of annotations and (b) inter-annotator agreement for batches with and without pre-annotations. In (a) each circle represents a batch of queries; in (b) each circle represents an annotator pair.

**Figure 5**
Comparison of the distribution of annotations over categories for batches with and without pre-annotations (stars indicate statistical differences in the distribution at the category level).

See this image and copyright information in PMC

References

1. Peters B, Dirscherl S, Dantzer J, Nowacki J, Cross S, Li X, Cornetta K, Dinauer MC, Mooney MD. Automated analysis of viral integration sites in gene therapy research using the SeqMap web resource. Gene Ther. 2008 Sep;15(18):1294–8. - PMC - PubMed
1. Srinivasan P, Libbus B. Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004 Aug 4;20(Suppl 1):i290–6. - PubMed
1. Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu HH, Torres R, Krauthammer M, Lau WW, Liu H, Hsu CN, Schuemie M, Cohen KB, Hirschman L. Overview of BioCreative II gene normalization. Genome Biol. 2008;9(Suppl 2):S3. - PMC - PubMed
1. Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA, Jr, Hunter L, Carpenter B, Tsai RT, Dai HJ, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Maña-López M, Mata J, Wilbur WJ. Overview of BioCreative II gene mention recognition. Genome Biol. 2008;9(Suppl 2):S2. - PMC - PubMed
1. Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, Roebuck S, Tobin R, Wang X. The ITI TXM Corpus: Tissue Expression and Protein-protein interactions. Proceedings of the LREC Workshop on Building and Evaluating Resources for Biomedical Text Mining; [Retrieved 08/18/09]. 2008b. from [ http://www.ltg.ed.ac.uk/np/publications/ltg/papers/Alex2008Corpora.pdf]

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

Affiliation

Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources