Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations
- PMID: 31199327
- PMCID: PMC6595941
- DOI: 10.2196/12876
Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations
Abstract
Background: Social media platforms constitute a rich data source for natural language processing tasks such as named entity recognition, relation extraction, and sentiment analysis. In particular, social media platforms about health provide a different insight into patient's experiences with diseases and treatment than those found in the scientific literature.
Objective: This paper aimed to report a study of entities related to chronic diseases and their relation in user-generated text posts. The major focus of our research is the study of biomedical entities found in health social media platforms and their relations and the way people suffering from chronic diseases express themselves.
Methods: We collected a corpus of 17,624 text posts from disease-specific subreddits of the social news and discussion website Reddit. For entity and relation extraction from this corpus, we employed the PKDE4J tool developed by Song et al (2015). PKDE4J is a text mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework.
Results: Using PKDE4J, we extracted 2 types of entities and relations: biomedical entities and relations and subject-predicate-object entity relations. In total, 82,138 entities and 30,341 relation pairs were extracted from the Reddit dataset. The most highly mentioned entities were those related to oncological disease (2884 occurrences of cancer) and asthma (2180 occurrences). The relation pair anatomy-disease was the most frequent (5550 occurrences), the highest frequent entities in this pair being cancer and lymph. The manual validation of the extracted entities showed a very good performance of the system at the entity extraction task (3682/5151, 71.48% extracted entities were correctly labeled).
Conclusions: This study showed that people are eager to share their personal experience with chronic diseases on social media platforms despite possible privacy and security issues. The results reported in this paper are promising and demonstrate the need for more in-depth studies on the way patients with chronic diseases express themselves on social media platforms.
Keywords: chronic disease; data mining; social media.
©Vasiliki Foufi, Tatsawan Timakum, Christophe Gaudet-Blavignac, Christian Lovis, Min Song. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 13.06.2019.
Conflict of interest statement
Conflicts of Interest: CL is editor-in-chief for JMIR Medical Informatics.
Figures
Similar articles
-
PKDE4J: Entity and relation extraction for public knowledge discovery.J Biomed Inform. 2015 Oct;57:320-32. doi: 10.1016/j.jbi.2015.08.008. Epub 2015 Aug 12. J Biomed Inform. 2015. PMID: 26277115
-
Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach.J Med Internet Res. 2023 Sep 19;25:e45767. doi: 10.2196/45767. J Med Internet Res. 2023. PMID: 37725432 Free PMC article.
-
Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):68. doi: 10.1186/s12911-016-0294-3. BMC Med Inform Decis Mak. 2016. PMID: 27454860 Free PMC article.
-
Social Media as a Research Tool (SMaaRT) for Risky Behavior Analytics: Methodological Review.JMIR Public Health Surveill. 2020 Nov 30;6(4):e21660. doi: 10.2196/21660. JMIR Public Health Surveill. 2020. PMID: 33252345 Free PMC article. Review.
-
Mining social media for prescription medication abuse monitoring: a review and proposal for a data-centric framework.J Am Med Inform Assoc. 2020 Feb 1;27(2):315-329. doi: 10.1093/jamia/ocz162. J Am Med Inform Assoc. 2020. PMID: 31584645 Free PMC article. Review.
Cited by
-
Disruptions in the Cystic Fibrosis Community's Experiences and Concerns During the COVID-19 Pandemic: Topic Modeling and Time Series Analysis of Reddit Comments.J Med Internet Res. 2023 Apr 20;25:e45249. doi: 10.2196/45249. J Med Internet Res. 2023. PMID: 37079359 Free PMC article.
-
Characterizing Pathways of Non-oral Prescription Stimulant Non-medical Use Among Adults Recruited From Reddit.Front Psychiatry. 2021 Jan 25;11:631792. doi: 10.3389/fpsyt.2020.631792. eCollection 2020. Front Psychiatry. 2021. PMID: 33597899 Free PMC article.
-
Medicalizing risk: How experts and consumers manage uncertainty in genetic health testing.PLoS One. 2022 Aug 4;17(8):e0270430. doi: 10.1371/journal.pone.0270430. eCollection 2022. PLoS One. 2022. PMID: 35925961 Free PMC article.
-
Identifying mental health discussion topic in social media community: subreddit of bipolar disorder analysis.Front Res Metr Anal. 2023 Nov 3;8:1243407. doi: 10.3389/frma.2023.1243407. eCollection 2023. Front Res Metr Anal. 2023. PMID: 38025958 Free PMC article.
-
Comparing Literature- and Subreddit-Derived Laboratory Values in Polycystic Ovary Syndrome (PCOS): Validation of Clinical Data Posted on PCOS Reddit Forums.JMIR Form Res. 2023 Aug 25;7:e44810. doi: 10.2196/44810. JMIR Form Res. 2023. PMID: 37624626 Free PMC article.
References
-
- Denecke K. Health Web Science: Social Media Data for Healthcare. New York: Springer International Publishing; 2015.
-
- ReferralMD. 2017. [2019-06-03]. 30 Facts & Stats on Social Media and Healthcare https://getreferralmd.com/2017/01/30-facts-statistics-on-social-media-an...
-
- Pew Research Center. [2019-06-03]. Chronic Disease and the Internet https://www.pewinternet.org/2010/03/24/chronic-disease-and-the-internet/
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources