Comparing a Rule Based vs. Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty
- PMID: 19956557
- PMCID: PMC2782854
- DOI: 10.1002/asi.21170
Comparing a Rule Based vs. Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty
Abstract
Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings(®) (MeSH(®)) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI) based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for one hundred MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures, performance is comparable, and for one measure, JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule based) might be combined and then evaluated showing they are complementary to one another.
Similar articles
-
Journal descriptor indexing tool for categorizing text according to discipline or semantic type.AMIA Annu Symp Proc. 2006;2006:960. AMIA Annu Symp Proc. 2006. PMID: 17238579 Free PMC article.
-
Word Sense Disambiguation by Selecting the Best Semantic Type Based on Journal Descriptor Indexing: Preliminary Experiment.J Am Soc Inf Sci Technol. 2006 Jan 1;57(1):96-113. doi: 10.1002/asi.20257. J Am Soc Inf Sci Technol. 2006. PMID: 19890434 Free PMC article.
-
Automatic Indexing of Documents from Journal Descriptors: A Preliminary Investigation.J Am Soc Inf Sci. 1999;50(8):661-674. doi: 10.1002/(SICI)1097-4571(1999)50:8<661::AID-ASI4>3.0.CO;2-R. J Am Soc Inf Sci. 1999. PMID: 21712970 Free PMC article.
-
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
-
Orthopaedic literature and MeSH.Clin Orthop Relat Res. 2010 Oct;468(10):2621-6. doi: 10.1007/s11999-010-1387-4. Clin Orthop Relat Res. 2010. PMID: 20623263 Free PMC article. Review.
Cited by
-
Extracting laboratory test information from biomedical text.J Pathol Inform. 2013 Aug 31;4:23. doi: 10.4103/2153-3539.117450. eCollection 2013. J Pathol Inform. 2013. PMID: 24083058 Free PMC article.
-
How are the different specialties represented in the major journals in general medicine?BMC Med Inform Decis Mak. 2011 Jan 21;11:3. doi: 10.1186/1472-6947-11-3. BMC Med Inform Decis Mak. 2011. PMID: 21255439 Free PMC article.
References
-
- American Medical Association. JAMA & Archives Topic Collections. 2008. [Retrieved November 21, 2008]. from http://pubs.ama-assn.org/collections.
-
- Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer. Studies in Health Technology and Informatics. 2004. [Retrieved November 21, 2008]. pp. 268–272. from http://skr.nlm.nih.gov/papers/references/aronson-medinfo04.wheader.pdf. - PubMed
-
- Aronson AR, Mork JG, Lang FM, Rogers WJ, Névéol A. Bethesda, MD: U.S. National Library of Medicine; 2008. Apr [Retrieved November 21, 2008]. NLM Medical Text Indexer: a tool for automatic and assisted indexing. NLM Technical Report No. LHNCBC-TR-2008-002; pp. 12–13. 2008. Section 4.4 Word Sense Disambiguation. from http://lhncbc/lhc/docs/reports/2008/tr2008002.pdf.
-
- CHU Hôpitaux de Rouen. Catalogue et Index des Sites Medicaux Francophones. 2008a. [Retrieved November 21, 2008]. from http://www.chu-rouen.org.
Grants and funding
LinkOut - more resources
Full Text Sources