New directions in biomedical text annotation: definitions, guidelines and corpus construction

W John Wilbur¹, Andrey Rzhetsky, Hagit Shatkay

Affiliations

PMID: 16867190
PMCID: PMC1559725
DOI: 10.1186/1471-2105-7-356

New directions in biomedical text annotation: definitions, guidelines and corpus construction

W John Wilbur et al. BMC Bioinformatics. 2006.

. 2006 Jul 25:7:356.

doi: 10.1186/1471-2105-7-356.

Authors

W John Wilbur¹, Andrey Rzhetsky, Hagit Shatkay

Affiliation

¹ National Center for Biotechnology Information NLM, NIH, Bethesda, MD, USA. wilbur@ncbi.nlm.nih.gov

PMID: 16867190
PMCID: PMC1559725
DOI: 10.1186/1471-2105-7-356

Abstract

Background: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined.

Results: We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70-80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task.

Conclusion: We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.

PubMed Disclaimer

Cited by

Extracting semantically enriched events from biomedical literature.
Miwa M, Thompson P, McNaught J, Kell DB, Ananiadou S. Miwa M, et al. BMC Bioinformatics. 2012 May 23;13:108. doi: 10.1186/1471-2105-13-108. BMC Bioinformatics. 2012. PMID: 22621266 Free PMC article.
Elaboration of a new framework for fine-grained epidemiological annotation.
Valentin S, Arsevska E, Vilain A, De Waele V, Lancelot R, Roche M. Valentin S, et al. Sci Data. 2022 Oct 26;9(1):655. doi: 10.1038/s41597-022-01743-2. Sci Data. 2022. PMID: 36289243 Free PMC article.
Automatic categorization of diverse experimental information in the bioscience literature.
Fang R, Schindelman G, Van Auken K, Fernandes J, Chen W, Wang X, Davis P, Tuli MA, Marygold SJ, Millburn G, Matthews B, Zhang H, Brown N, Gelbart WM, Sternberg PW. Fang R, et al. BMC Bioinformatics. 2012 Jan 26;13:16. doi: 10.1186/1471-2105-13-16. BMC Bioinformatics. 2012. PMID: 22280404 Free PMC article.
Biomedical text mining and its applications.
Rodriguez-Esteban R. Rodriguez-Esteban R. PLoS Comput Biol. 2009 Dec;5(12):e1000597. doi: 10.1371/journal.pcbi.1000597. Epub 2009 Dec 24. PLoS Comput Biol. 2009. PMID: 20041219 Free PMC article. No abstract available.
The BioLexicon: a large-scale terminological resource for biomedical text mining.
Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, Monachini M, Pezik P, Quochi V, Rupp CJ, Sasaki Y, Venturi G, Rebholz-Schuhmann D, Ananiadou S. Thompson P, et al. BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397. BMC Bioinformatics. 2011. PMID: 21992002 Free PMC article.

See all "Cited by" articles

References

1. Mukherjea S. Information retrieval and knowledge discovery utilising a biomedical Semantic Web. Briefings in Bioinformatics. 2005;6:252–262. - PubMed
1. Shatkay H. Hairpins in bookstacks: Information retrieval from biomedical text. Briefings in Bioinformatics. 2005;6:222–238. - PubMed
1. Spasic I, Ananiadou S, McNaught J, Kumar A. Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics. 2005;6:239–251. - PubMed
1. Skusa A, Ruegg A, Köhler J. Extraction of biological interaction networks from scientific literature. Briefings in Bioinformatics. 2005;6:263–276. - PubMed
1. Weeber M, Kors JA, Mons B. Online tools to support literature-based discovery in the life sciences. Briefings in Bioinformatics. 2005;6:277–286. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

New directions in biomedical text annotation: definitions, guidelines and corpus construction

Affiliation

New directions in biomedical text annotation: definitions, guidelines and corpus construction

Authors

Affiliation

Abstract

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources