Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 22:26:e48324.
doi: 10.2196/48324.

Identifying the Risk Factors of Allergic Rhinitis Based on Zhihu Comment Data Using a Topic-Enhanced Word-Embedding Model: Mixed Method Study and Cluster Analysis

Affiliations

Identifying the Risk Factors of Allergic Rhinitis Based on Zhihu Comment Data Using a Topic-Enhanced Word-Embedding Model: Mixed Method Study and Cluster Analysis

Dongxiao Gu et al. J Med Internet Res. .

Abstract

Background: Allergic rhinitis (AR) is a chronic disease, and several risk factors predispose individuals to the condition in their daily lives, including exposure to allergens and inhalation irritants. Analyzing the potential risk factors that can trigger AR can provide reference material for individuals to use to reduce its occurrence in their daily lives. Nowadays, social media is a part of daily life, with an increasing number of people using at least 1 platform regularly. Social media enables users to share experiences among large groups of people who share the same interests and experience the same afflictions. Notably, these channels promote the ability to share health information.

Objective: This study aims to construct an intelligent method (TopicS-ClusterREV) for identifying the risk factors of AR based on these social media comments. The main questions were as follows: How many comments contained AR risk factor information? How many categories can these risk factors be summarized into? How do these risk factors trigger AR?

Methods: This study crawled all the data from May 2012 to May 2022 under the topic of allergic rhinitis on Zhihu, obtaining a total of 9628 posts and 33,747 comments. We improved the Skip-gram model to train topic-enhanced word vector representations (TopicS) and then vectorized annotated text items for training the risk factor classifier. Furthermore, cluster analysis enabled a closer look into the opinions expressed in the category, namely gaining insight into how risk factors trigger AR.

Results: Our classifier identified more comments containing risk factors than the other classification models, with an accuracy rate of 96.1% and a recall rate of 96.3%. In general, we clustered texts containing risk factors into 28 categories, with season, region, and mites being the most common risk factors. We gained insight into the risk factors expressed in each category; for example, seasonal changes and increased temperature differences between day and night can disrupt the body's immune system and lead to the development of allergies.

Conclusions: Our approach can handle the amount of data and extract risk factors effectively. Moreover, the summary of risk factors can serve as a reference for individuals to reduce AR in their daily lives. The experimental data also provide a potential pathway that triggers AR. This finding can guide the development of management plans and interventions for AR.

Keywords: chronic disease management; disease risk factor identification; social media platforms; text mining; topic-enhanced word embedding.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Allergic rhinitis (AR) risk factor identification method based on the topic-enhanced word-embedding model (TopicS-ClusterREV). The figure shows the research framework of our study. The framework consists of 3 parts. The first part is data collection and processing aimed at obtaining a clean data set. The second part is risk factor identification, which includes the proposed TopicS method and training of a risk factor classifier. The third part is text clustering and keyword extraction, which uses the ClusterREV method to cluster identified risk factors and extract keywords from every category.
Figure 2
Figure 2
Examples of short text annotation. The figure shows examples of data labeled as 1 including the text and label. In the figure, phrases with a blue background indicate those that were specifically noted during manual annotation, and the presence of these marker phrases often suggests potential risk factors in the sentence. The yellow background highlights the risk factors in the text.
Figure 3
Figure 3
Topic-enhanced word-embedding model (TopicS). The figure illustrates the vector changes within the TopicS model. The rectangular boxes in both the input and output represent one-hot vectors. Within the input, the dark blue circles signify the center words, representing a value of 1, whereas the light blue circles denote other words in the training text, with a value of 0. For the output’s task 1, the dark blue circles depict context words surrounding the center word, signifying a value of 1, whereas the light blue circles represent noncontext words with a value of 0. The various colored circles in the output’s task 2 indicate the topics to which the center word belongs. If it pertains to the risk factor topic, it is marked by a dark blue circle, symbolizing a value of 1, whereas circles of other colors represent a value of 0.
Figure 4
Figure 4
Framework of the classification model with different word embedding. This figure illustrates the TextCNN modeling process for text vectorization using both the skip-gram and TopicS techniques. In the example sentence, “spring” and “pollen” are highlighted as risk factors. These words are represented by blue squares in TopicS, suggesting that TopicS incorporates topic information, unlike the skip-gram method. These thematic data are subsequently integrated into the convolution, max-pooling, and softmax procedures to enhance the model’s classification capabilities.
Figure 5
Figure 5
Cluster method with review mechanisms (ClusterREV). This figure depicts the process of the ClusterREV algorithm. The rectangular boxes represent category state transitions. The circles below the rectangles indicate the texts awaiting clustering. The algorithm assesses the distance between the current text and existing categories, classifying the text based on the minimal distance and a set threshold. Once all texts have been clustered, texts within a solitary category undergo automatic review. Finally, we manually reviewed the clustering results.

Similar articles

  • Sexual Harassment and Prevention Training.
    Cedeno R, Bohlen J. Cedeno R, et al. 2024 Mar 29. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2024 Mar 29. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 36508513 Free Books & Documents.
  • Stigma Management Strategies of Autistic Social Media Users.
    Koteyko N, Van Driel M, Billan S, Barros Pena B, Vines J. Koteyko N, et al. Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun. Autism Adulthood. 2025. PMID: 40539215
  • The Lived Experience of Autistic Adults in Employment: A Systematic Search and Synthesis.
    Thorpe D, McKinlay M, Richards J, Sang K, Stewart ME. Thorpe D, et al. Autism Adulthood. 2024 Dec 2;6(4):495-509. doi: 10.1089/aut.2022.0114. eCollection 2024 Dec. Autism Adulthood. 2024. PMID: 40018061 Review.
  • Interventions to improve inhaler technique for people with asthma.
    Normansell R, Kew KM, Mathioudakis AG. Normansell R, et al. Cochrane Database Syst Rev. 2017 Mar 13;3(3):CD012286. doi: 10.1002/14651858.CD012286.pub2. Cochrane Database Syst Rev. 2017. PMID: 28288272 Free PMC article.
  • Short-Term Memory Impairment.
    Cascella M, Al Khalili Y. Cascella M, et al. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 31424720 Free Books & Documents.

Cited by

References

    1. Pawankar R, Baena-Cagnani CE, Bousquet J, Walter Canonica G, Cruz AA, Kaliner MA, Lanier BQ, Henley K. State of world allergy report 2008: allergy and chronic respiratory diseases. World Allergy Org J. 2008;1:S4–17. doi: 10.1186/1939-4551-1-S1-S4. - DOI - PMC - PubMed
    1. Krishna MT, Mahesh PA, Vedanthan PK, Mehta V, Moitra S, Christopher DJ. The burden of allergic diseases in the Indian subcontinent: barriers and challenges. Lancet Glob Health. 2020 Apr;8(4):e478–9. doi: 10.1016/S2214-109X(20)30061-9. https://linkinghub.elsevier.com/retrieve/pii/S2214-109X(20)30061-9 S2214-109X(20)30061-9 - DOI - PubMed
    1. Greiner AN, Hellings PW, Rotiroti G, Scadding GK. Allergic rhinitis. Lancet. 2011 Dec 17;378(9809):2112–22. doi: 10.1016/S0140-6736(11)60130-X.S0140-6736(11)60130-X - DOI - PubMed
    1. Wang XD, Zheng M, Lou HF, Wang CS, Zhang Y, Bo MY, Ge SQ, Zhang N, Zhang L, Bachert C. An increased prevalence of self-reported allergic rhinitis in major Chinese cities from 2005 to 2011. Allergy. 2016 Aug 13;71(8):1170–80. doi: 10.1111/all.12874. https://europepmc.org/abstract/MED/26948849 - DOI - PMC - PubMed
    1. Price D, Smith P, Hellings P, Papadopoulos N, Fokkens W, Muraro A, Murray R, Chisholm A, Demoly P, Scadding G, Mullol J, Lieberman P, Bachert C, Mösges R, Ryan D, Bousquet J. Current controversies and challenges in allergic rhinitis management. Expert Rev Clin Immunol. 2015 Aug 29;11(11):1205–17. doi: 10.1586/1744666x.2015.1081814. - DOI - PubMed

Publication types