Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

doi:10.1038/s41598-023-39986-7

. 2023 Aug 22;13(1):13721.

doi: 10.1038/s41598-023-39986-7.

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

Muzhe Guo¹, Yong Ma², Efe Eworuke³, Melissa Khashei⁴, Jaejoon Song², Yueqin Zhao², Fang Jin⁵

Affiliations

¹ Department of Statistics, George Washington University, 2121 I St NW, Washington, DC, 20052, USA.
² Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA), 10903 New Hampshire Avenue, Silver Spring, MD, 20993, USA.
³ Epidemiology and Drug Safety, IQVIA Real World Solutions, Durham, USA.
⁴ Division of Epidemiology II, Office of Pharmacovigilance and Epidemiology, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, Food and Drug Administration (FDA), 10903 New Hampshire Avenue, Silver Spring, MD, 20993, USA.
⁵ Department of Statistics, George Washington University, 2121 I St NW, Washington, DC, 20052, USA. fangjin@gwu.edu.

PMID: 37607963
PMCID: PMC10444846
DOI: 10.1038/s41598-023-39986-7

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

Muzhe Guo et al. Sci Rep. 2023.

. 2023 Aug 22;13(1):13721.

doi: 10.1038/s41598-023-39986-7.

Authors

Muzhe Guo¹, Yong Ma², Efe Eworuke³, Melissa Khashei⁴, Jaejoon Song², Yueqin Zhao², Fang Jin⁵

Affiliations

¹ Department of Statistics, George Washington University, 2121 I St NW, Washington, DC, 20052, USA.
² Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration (FDA), 10903 New Hampshire Avenue, Silver Spring, MD, 20993, USA.
³ Epidemiology and Drug Safety, IQVIA Real World Solutions, Durham, USA.
⁴ Division of Epidemiology II, Office of Pharmacovigilance and Epidemiology, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, Food and Drug Administration (FDA), 10903 New Hampshire Avenue, Silver Spring, MD, 20993, USA.
⁵ Department of Statistics, George Washington University, 2121 I St NW, Washington, DC, 20052, USA. fangjin@gwu.edu.

PMID: 37607963
PMCID: PMC10444846
DOI: 10.1038/s41598-023-39986-7

Abstract

We used social media data from "covid19positive" subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021-09/2021) and Omicron (12/2021-03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Classification model overview. The first row shows that we convert the original posts into chunks and feed them into the BERT-Large for Sequence Classification model to get chunk scores. The second row shows that the weights of chunk scores are trained with deep neural networks, and the final prediction is obtained by weighting chunk scores.

**Figure 2**
QuadArm model overview. Model inputs are positive posts and outputs are symptoms named by the UMLS. The model consists of four steps: BERT/BioBERT for question answering, dual-corpus expansion, adaptive rotation clustering, and mapping.

**Figure 3**
Daily trends in number of COVID-19 cases reported to the CDC and we extracted, for the corresponding three periods.

**Figure 4**
Symptom clustering and trending based on our model QuadArm with BioBERT. Panel (a) shows the clustering results of COVID-19 symptoms through t-SNE visualization for the SARS-CoV-2 early period, Delta period, and Omicron period, respectively. Panel (b) is a ThemeRiver plot showing the change of ten common symptom frequencies over time in the three COVID-19 periods.

**Figure 5**
Comparison of symptoms extracted by our model (QuadArm with BioBERT) in the two virus variation periods. Panel (a) shows the top 15 commonly reported symptoms for each period. Panel (b) includes two Chord diagrams showing the co-appearance relationship between symptoms for each period. The width of the connection between two symptoms represents the number of authors with both symptoms.

**Figure 6**
A closer look of the COVID-19 symptom corpus system (Omicron variant period). The left column is the COVID-19 key-word corpus obtained from our dual-corpus expansion method. The middle column is refined symptoms after applying the key-words corpus on the marked answers. The right column is the final standardize medical symptom names obtained by mapping to UMLS. The width of the connection line represents the number of corresponding authors.

See this image and copyright information in PMC

Cited by

BERT-based language model for accurate drug adverse event extraction from social media: implementation, evaluation, and contributions to pharmacovigilance practices.
Dong F, Guo W, Liu J, Patterson TA, Hong H. Dong F, et al. Front Public Health. 2024 Apr 23;12:1392180. doi: 10.3389/fpubh.2024.1392180. eCollection 2024. Front Public Health. 2024. PMID: 38716250 Free PMC article.
Pharmacovigilance in the digital age: gaining insight from social media data.
Dong F, Guo W, Liu J, Patterson TA, Hong H. Dong F, et al. Exp Biol Med (Maywood). 2025 May 27;250:10555. doi: 10.3389/ebm.2025.10555. eCollection 2025. Exp Biol Med (Maywood). 2025. PMID: 40495881 Free PMC article. Review.
LLM enabled classification of patient self-reported symptoms and needs in health systems across the USA.
Naved BA, Ravishankar S, Colbert GE, Johnston A, Slott QM, Luo Y. Naved BA, et al. NPJ Digit Med. 2025 Jul 1;8(1):390. doi: 10.1038/s41746-025-01779-9. NPJ Digit Med. 2025. PMID: 40595018 Free PMC article.

References

1. Guan W-J, et al. Clinical characteristics of coronavirus disease 2019 in China. N. Engl. J. Med. 2020;382:1708–1720. doi: 10.1056/NEJMoa2002032. - DOI - PMC - PubMed
1. Alimohamadi Y, Sepandi M, Taghdir M, Hosamirudsari H. Determine the most common clinical symptoms in COVID-19 patients: A systematic review and meta-analysis. J. Prev. Med. Hyg. 2020;61:E304. - PMC - PubMed
1. Fu L, et al. Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A systematic review and meta-analysis. J. Infect. 2020;80:656–665. doi: 10.1016/j.jinf.2020.03.041. - DOI - PMC - PubMed
1. Bialek, S. et al. Coronavirus disease 2019 in children—United States, February 12–April 2, 2020 (2020). - PMC - PubMed
1. Struyf T, et al. Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19. Cochrane Database Syst. Rev. 2022 doi: 10.1002/14651858.CD013665.pub3. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

[1] Guan W-J, et al. Clinical characteristics of coronavirus disease 2019 in China. N. Engl. J. Med. 2020;382:1708–1720. doi: 10.1056/NEJMoa2002032. - DOI - PMC - PubMed

[2] Guan W-J, et al. Clinical characteristics of coronavirus disease 2019 in China. N. Engl. J. Med. 2020;382:1708–1720. doi: 10.1056/NEJMoa2002032. - DOI - PMC - PubMed

[3] Alimohamadi Y, Sepandi M, Taghdir M, Hosamirudsari H. Determine the most common clinical symptoms in COVID-19 patients: A systematic review and meta-analysis. J. Prev. Med. Hyg. 2020;61:E304. - PMC - PubMed

[4] Alimohamadi Y, Sepandi M, Taghdir M, Hosamirudsari H. Determine the most common clinical symptoms in COVID-19 patients: A systematic review and meta-analysis. J. Prev. Med. Hyg. 2020;61:E304. - PMC - PubMed

[5] Fu L, et al. Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A systematic review and meta-analysis. J. Infect. 2020;80:656–665. doi: 10.1016/j.jinf.2020.03.041. - DOI - PMC - PubMed

[6] Fu L, et al. Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A systematic review and meta-analysis. J. Infect. 2020;80:656–665. doi: 10.1016/j.jinf.2020.03.041. - DOI - PMC - PubMed

[7] Bialek, S. et al. Coronavirus disease 2019 in children—United States, February 12–April 2, 2020 (2020). - PMC - PubMed

[8] Bialek, S. et al. Coronavirus disease 2019 in children—United States, February 12–April 2, 2020 (2020). - PMC - PubMed

[9] Struyf T, et al. Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19. Cochrane Database Syst. Rev. 2022 doi: 10.1002/14651858.CD013665.pub3. - DOI - PMC - PubMed

[10] Struyf T, et al. Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19. Cochrane Database Syst. Rev. 2022 doi: 10.1002/14651858.CD013665.pub3. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

Affiliations

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Medical