Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr;44(2):310-8.
doi: 10.1016/j.jbi.2010.11.001. Epub 2010 Nov 20.

Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

Affiliations

Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction

Aurélie Névéol et al. J Biomed Inform. 2011 Apr.

Abstract

Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical queries. Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations. The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community. This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The sixteen category annotation scheme used to annotate PubMed queries is displayed on the left side of the Knowtator screen shot. A sample annotated query from the corpus is shown in the middle. Note that categories Devices and Living Beings correspond to the eponym UMLS Semantic Groups while categories Body Part, Cell or Cell Component and Tissue correspond to UMLS Semantic Types. Specific definitions and examples for all categories are given in Table 1 and in the supplementary data.
Figure 2
Figure 2
Comparison of annotation time in minutes (a) overall and (b) for the seven annotators as the annotation task progresses. Full circles represent batches with pre-annotations while hollow circles represent batches without pre-annotations. Note that all the batches with pre-annotation appear between the vertical dashed lines.
Figure 3
Figure 3
Comparison of the number of actions (a) overall and (b) for the seven annotators as a function of annotation time. Full circles represent batches with pre-annotations while hollow circles represent batches without pre-annotations.
Figure 4
Figure 4
Comparison of (a) final number of annotations and (b) inter-annotator agreement for batches with and without pre-annotations. In (a) each circle represents a batch of queries; in (b) each circle represents an annotator pair.
Figure 5
Figure 5
Comparison of the distribution of annotations over categories for batches with and without pre-annotations (stars indicate statistical differences in the distribution at the category level).

References

    1. Peters B, Dirscherl S, Dantzer J, Nowacki J, Cross S, Li X, Cornetta K, Dinauer MC, Mooney MD. Automated analysis of viral integration sites in gene therapy research using the SeqMap web resource. Gene Ther. 2008 Sep;15(18):1294–8. - PMC - PubMed
    1. Srinivasan P, Libbus B. Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004 Aug 4;20(Suppl 1):i290–6. - PubMed
    1. Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu HH, Torres R, Krauthammer M, Lau WW, Liu H, Hsu CN, Schuemie M, Cohen KB, Hirschman L. Overview of BioCreative II gene normalization. Genome Biol. 2008;9(Suppl 2):S3. - PMC - PubMed
    1. Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA, Jr, Hunter L, Carpenter B, Tsai RT, Dai HJ, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Maña-López M, Mata J, Wilbur WJ. Overview of BioCreative II gene mention recognition. Genome Biol. 2008;9(Suppl 2):S2. - PMC - PubMed
    1. Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, Roebuck S, Tobin R, Wang X. The ITI TXM Corpus: Tissue Expression and Protein-protein interactions. Proceedings of the LREC Workshop on Building and Evaluating Resources for Biomedical Text Mining; [Retrieved 08/18/09]. 2008b. from [ http://www.ltg.ed.ac.uk/np/publications/ltg/papers/Alex2008Corpora.pdf]

Publication types

LinkOut - more resources