Text mining and natural language processing approaches for automatic categorization of lay requests to web-based expert forums
- PMID: 19632978
- PMCID: PMC2762848
- DOI: 10.2196/jmir.1123
Text mining and natural language processing approaches for automatic categorization of lay requests to web-based expert forums
Abstract
Background: Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called "ask the doctor" services.
Objective: To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies.
Methods: We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website "Rund ums Baby" ("Everything about Babies") into one or more of 38 categories belonging to two dimensions ("subject matter" and "expectations"). After creating start and synonym lists, we calculated the average Cramer's V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification.
Results: According to the manual classification of 988 documents, 102 (10%) documents fell into the category "in vitro fertilization (IVF)," 81 (8%) into the category "ovulation," 79 (8%) into "cycle," and 57 (6%) into "semen analysis." These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as "general information" and 351 (36%) as a wish for "treatment recommendations." The generation of indicator variables based on the chi-square analysis and Cramer's V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, "words") also included variables from other categories, most often with a negative sign. For example, absence of words predictive for "menstruation" was a strong indicator for the category "pregnancy test."
Conclusions: Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback.
Conflict of interest statement
HWM is one of the experts who work for the Rund ums Baby forum on an honorary basis. UR is an employee of SAS Institute Germany and works in the Enterprise Intelligence Competence Centre.
Similar articles
-
Information needs and visitors' experience of an Internet expert forum on infertility.J Med Internet Res. 2005 Jun 30;7(2):e20. doi: 10.2196/jmir.7.2.e20. J Med Internet Res. 2005. PMID: 15998611 Free PMC article.
-
Searching for cancer information on the internet: analyzing natural language search queries.J Med Internet Res. 2003 Dec 11;5(4):e31. doi: 10.2196/jmir.5.4.e31. J Med Internet Res. 2003. PMID: 14713659 Free PMC article.
-
Gender differences in help-seeking behavior on two internet forums for individuals with self-reported depression.Gend Med. 2008 Jun;5(2):181-5. doi: 10.1016/j.genm.2008.05.008. Gend Med. 2008. PMID: 18573484
-
The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews.J Med Internet Res. 2016 May 10;18(5):e108. doi: 10.2196/jmir.4430. J Med Internet Res. 2016. PMID: 27165558 Free PMC article. Review.
-
A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data.Int J Med Inform. 2019 May;125:37-46. doi: 10.1016/j.ijmedinf.2019.02.008. Epub 2019 Feb 20. Int J Med Inform. 2019. PMID: 30914179 Free PMC article.
Cited by
-
What Patients Can Tell Us: Topic Analysis for Social Media on Breast Cancer.JMIR Med Inform. 2017 Jul 31;5(3):e23. doi: 10.2196/medinform.7779. JMIR Med Inform. 2017. PMID: 28760725 Free PMC article.
-
Raising Awareness About Cervical Cancer Using Twitter: Content Analysis of the 2015 #SmearForSmear Campaign.J Med Internet Res. 2017 Oct 16;19(10):e344. doi: 10.2196/jmir.8421. J Med Internet Res. 2017. PMID: 29038096 Free PMC article.
-
P2P watch: personal health information detection in peer-to-peer file-sharing networks.J Med Internet Res. 2012 Jul 9;14(4):e95. doi: 10.2196/jmir.1898. J Med Internet Res. 2012. PMID: 22776692 Free PMC article.
-
Text classification for assisting moderators in online health communities.J Biomed Inform. 2013 Dec;46(6):998-1005. doi: 10.1016/j.jbi.2013.08.011. Epub 2013 Sep 8. J Biomed Inform. 2013. PMID: 24025513 Free PMC article.
-
Misleading health-related information promoted through video-based social media: anorexia on YouTube.J Med Internet Res. 2013 Feb 13;15(2):e30. doi: 10.2196/jmir.2237. J Med Internet Res. 2013. PMID: 23406655 Free PMC article.
References
-
- Umefjord Göran, Hamberg Katarina, Malker Hans, Petersson Göran. The use of an Internet-based Ask the Doctor Service involving family physicians: evaluation by a web survey. Fam Pract. 2006 Apr;23(2):159–166. doi: 10.1093/fampra/cmi117. http://fampra.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=16464871cmi117 - DOI - PubMed
-
- Marco Javier, Barba Raquel, Losa Juan E, de la Serna Carlos Martínez, Sainz María, Lantigua Isabel Fernández, de la Serna Jose Luis. Advice from a medical expert through the Internet on queries about AIDS and hepatitis: analysis of a pilot experiment. PLoS Med. 2006 Jul;3(7):e256. doi: 10.1371/journal.pmed.0030256. http://dx.plos.org/10.1371/journal.pmed.003025605-PLME-RA-0426R2 - DOI - DOI - PMC - PubMed
-
- Eysenbach G, Diepgen TL. Patients looking for information on the Internet and seeking teleadvice: motivation, expectations, and misconceptions as expressed in e-mails sent to physicians. Arch Dermatol. 1999 Feb;135(2):151–156. doi: 10.1001/archderm.135.2.151. http://archderm.ama-assn.org/cgi/pmidlookup?view=long&pmid=10052399 - DOI - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources