Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

Antonio J Jimeno-Yepes¹, Bridget T McInnes, Alan R Aronson

Affiliations

PMID: 21635749
PMCID: PMC3123611
DOI: 10.1186/1471-2105-12-223

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

Antonio J Jimeno-Yepes et al. BMC Bioinformatics. 2011.

. 2011 Jun 2:12:223.

doi: 10.1186/1471-2105-12-223.

Authors

Antonio J Jimeno-Yepes¹, Bridget T McInnes, Alan R Aronson

Affiliation

¹ National Library of Medicine, Bethesda, MD 20894, USA. antonio.jimeno@gmail.com

PMID: 21635749
PMCID: PMC3123611
DOI: 10.1186/1471-2105-12-223

Abstract

Background: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD.

Methods: In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set.

Results: The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods.

Conclusions: The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.

PubMed Disclaimer

Figures

**Figure 1**
**Example query for one of the senses of term lens**. PubMed query used to retrieve citations which contain the term *lens* when it is related to *lens diseases*. The retrieved citations should have been indexed with the MeSH Heading *lens diseases* and should not be indexed with *Lens, Crystalline* or *Lenses*.

**Figure 2**
**WSD example for the term cold in ARFF format**. The *@RELATION* line contains the list of concepts from the Metathesaurus. Each data line has the PMID of the citation, the text where the ambiguous term appears and the sense number.

See this image and copyright information in PMC

References

1. Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC bioinformatics. 2005;6(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. - DOI - PMC - PubMed
1. Pezik P, Jimeno-Yepes A, Lee V, Rebholz-Schuhmann D. Static dictionary features for term polysemy identification. Building and evaluating resources for biomedical text mining, LREC Workshop. 2008.
1. Jimeno A, Jimenez-Ruiz E, Lee V, Gaudan S, Berlanga R, Rebholz-Schuhmann D. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC bioinformatics. 2008;9(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3. - DOI - PMC - PubMed
1. Leaman R, Miller C, Gonzalez G. Enabling Recognition of Diseases in Biomedical Text with Machine Learning: Corpus and Benchmark. Proceedings of the 2009 Symposium on Languages in Biology and Medicine. 2009.
1. Gaudan S, Kirsch H, Rebholz-Schuhmann D. Resolving abbreviations to their senses in Medline. Bioinformatics. 2005;21(18):3658. doi: 10.1093/bioinformatics/bti586. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

Affiliation

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources