New directions in biomedical text annotation: definitions, guidelines and corpus construction
- PMID: 16867190
- PMCID: PMC1559725
- DOI: 10.1186/1471-2105-7-356
New directions in biomedical text annotation: definitions, guidelines and corpus construction
Abstract
Background: While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined.
Results: We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70-80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task.
Conclusion: We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.
Similar articles
-
Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users.Bioinformatics. 2008 Sep 15;24(18):2086-93. doi: 10.1093/bioinformatics/btn381. Epub 2008 Aug 20. Bioinformatics. 2008. PMID: 18718948 Free PMC article.
-
PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.J Biomed Semantics. 2022 Jun 11;13(1):17. doi: 10.1186/s13326-022-00272-6. J Biomed Semantics. 2022. PMID: 35690873 Free PMC article.
-
NCBI disease corpus: a resource for disease name recognition and concept normalization.J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3. J Biomed Inform. 2014. PMID: 24393765 Free PMC article.
-
A survey of current work in biomedical text mining.Brief Bioinform. 2005 Mar;6(1):57-71. doi: 10.1093/bib/6.1.57. Brief Bioinform. 2005. PMID: 15826357 Review.
-
Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?Brief Bioinform. 2008 Nov;9(6):466-78. doi: 10.1093/bib/bbn043. Epub 2008 Dec 6. Brief Bioinform. 2008. PMID: 19060303 Review.
Cited by
-
Extracting semantically enriched events from biomedical literature.BMC Bioinformatics. 2012 May 23;13:108. doi: 10.1186/1471-2105-13-108. BMC Bioinformatics. 2012. PMID: 22621266 Free PMC article.
-
Elaboration of a new framework for fine-grained epidemiological annotation.Sci Data. 2022 Oct 26;9(1):655. doi: 10.1038/s41597-022-01743-2. Sci Data. 2022. PMID: 36289243 Free PMC article.
-
Automatic categorization of diverse experimental information in the bioscience literature.BMC Bioinformatics. 2012 Jan 26;13:16. doi: 10.1186/1471-2105-13-16. BMC Bioinformatics. 2012. PMID: 22280404 Free PMC article.
-
Biomedical text mining and its applications.PLoS Comput Biol. 2009 Dec;5(12):e1000597. doi: 10.1371/journal.pcbi.1000597. Epub 2009 Dec 24. PLoS Comput Biol. 2009. PMID: 20041219 Free PMC article. No abstract available.
-
The BioLexicon: a large-scale terminological resource for biomedical text mining.BMC Bioinformatics. 2011 Oct 12;12:397. doi: 10.1186/1471-2105-12-397. BMC Bioinformatics. 2011. PMID: 21992002 Free PMC article.
References
-
- Mukherjea S. Information retrieval and knowledge discovery utilising a biomedical Semantic Web. Briefings in Bioinformatics. 2005;6:252–262. - PubMed
-
- Shatkay H. Hairpins in bookstacks: Information retrieval from biomedical text. Briefings in Bioinformatics. 2005;6:222–238. - PubMed
-
- Spasic I, Ananiadou S, McNaught J, Kumar A. Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics. 2005;6:239–251. - PubMed
-
- Skusa A, Ruegg A, Köhler J. Extraction of biological interaction networks from scientific literature. Briefings in Bioinformatics. 2005;6:263–276. - PubMed
-
- Weeber M, Kors JA, Mons B. Online tools to support literature-based discovery in the life sciences. Briefings in Bioinformatics. 2005;6:277–286. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources