Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 3;11(3):41.
doi: 10.3390/biotech11030041.

Investigating Topic Modeling Techniques to Extract Meaningful Insights in Italian Long COVID Narration

Affiliations

Investigating Topic Modeling Techniques to Extract Meaningful Insights in Italian Long COVID Narration

Ileana Scarpino et al. BioTech (Basel). .

Abstract

Through an adequate survey of the history of the disease, Narrative Medicine (NM) aims to allow the definition and implementation of an effective, appropriate, and shared treatment path. In the present study different topic modeling techniques are compared, as Latent Dirichlet Allocation (LDA) and topic modeling based on BERT transformer, to extract meaningful insights in the Italian narration of COVID-19 pandemic. In particular, the main focus was the characterization of Post-acute Sequelae of COVID-19, (i.e., PASC) writings as opposed to writings by health professionals and general reflections on COVID-19, (i.e., non-PASC) writings, modeled as a semi-supervised task. The results show that the BERTopic-based approach outperforms the LDA-base approach by grouping in the same cluster the 97.26% of analyzed documents, and reaching an overall accuracy of 91.97%.

Keywords: BERTopic; LDA; narrative medicine; text mining; topic modeling.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Histogram of words count per document of the entire dataset.
Figure 2
Figure 2
Histogram of words count per document and class.
Figure 3
Figure 3
Word cloud showing the most frequent tokens in the document after preprocessing. Tokens with the largest font size are the most frequent.
Figure 4
Figure 4
Bar plot of keywords and weight for each topic modeled through the LDA approach.
Figure 5
Figure 5
Bar chart describing the distribution of topics modeled trough the LDA approach w.r.t. PASC or non-PASC classes.
Figure 6
Figure 6
Bar plot of keywords and weight for each topic modeled through the BERTopic approach. Topic -1 represents a virtual topic of unassigned documents to other topics.
Figure 7
Figure 7
Bar chart describing the distribution of topics modeled trough the BERTopic approach wrt PASC or non-PASC classes.

References

    1. Hossain M.M., Tasnim S., Sultana A., Faizah F., Mazumder H., Zou L., McKyer E.L.J., Ahmed H.U., Ma P. Epidemiology of mental health problems in COVID-19: A review. F1000Research. 2020;9:636. doi: 10.12688/f1000research.24457.1. - DOI - PMC - PubMed
    1. Rossi R., Socci V., Talevi D., Mensi S., Niolu C., Pacitti F., Di Marco A., Rossi A., Siracusano A., Di Lorenzo G. COVID-19 pandemic and lockdown measures impact on mental health among the general population in Italy. Front. Psychiatry. 2020;11:790. doi: 10.3389/fpsyt.2020.00790. - DOI - PMC - PubMed
    1. Maison D., Jaworska D., Adamczyk D., Affeltowicz D. The challenges arising from the COVID-19 pandemic and the way people deal with them. A qualitative longitudinal study. PLoS ONE. 2021;16:e0258133. doi: 10.1371/journal.pone.0258133. - DOI - PMC - PubMed
    1. Wicke P., Bolognesi M.M. Covid-19 Discourse on Twitter: How the Topics, Sentiments, Subjectivity, and Figurative Frames Changed Over Time. Front. Commun. 2021;6:45. doi: 10.3389/fcomm.2021.651997. - DOI
    1. Chandrasekaran R., Mehta V., Valkunde T., Moustakas E. Topics, trends, and sentiments of tweets about the COVID-19 pandemic: Temporal infoveillance study. J. Med. Internet Res. 2020;22:e22624. doi: 10.2196/22624. - DOI - PMC - PubMed

LinkOut - more resources