Natural Language Processing for Improved Characterization of COVID-19 Symptoms: Observational Study of 350,000 Patients in a Large Integrated Health Care System
- PMID: 36446133
- PMCID: PMC9822566
- DOI: 10.2196/41529
Natural Language Processing for Improved Characterization of COVID-19 Symptoms: Observational Study of 350,000 Patients in a Large Integrated Health Care System
Abstract
Background: Natural language processing (NLP) of unstructured text from electronic medical records (EMR) can improve the characterization of COVID-19 signs and symptoms, but large-scale studies demonstrating the real-world application and validation of NLP for this purpose are limited.
Objective: The aim of this paper is to assess the contribution of NLP when identifying COVID-19 signs and symptoms from EMR.
Methods: This study was conducted in Kaiser Permanente Southern California, a large integrated health care system using data from all patients with positive SARS-CoV-2 laboratory tests from March 2020 to May 2021. An NLP algorithm was developed to extract free text from EMR on 12 established signs and symptoms of COVID-19, including fever, cough, headache, fatigue, dyspnea, chills, sore throat, myalgia, anosmia, diarrhea, vomiting or nausea, and abdominal pain. The proportion of patients reporting each symptom and the corresponding onset dates were described before and after supplementing structured EMR data with NLP-extracted signs and symptoms. A random sample of 100 chart-reviewed and adjudicated SARS-CoV-2-positive cases were used to validate the algorithm performance.
Results: A total of 359,938 patients (mean age 40.4 [SD 19.2] years; 191,630/359,938, 53% female) with confirmed SARS-CoV-2 infection were identified over the study period. The most common signs and symptoms identified through NLP-supplemented analyses were cough (220,631/359,938, 61%), fever (185,618/359,938, 52%), myalgia (153,042/359,938, 43%), and headache (144,705/359,938, 40%). The NLP algorithm identified an additional 55,568 (15%) symptomatic cases that were previously defined as asymptomatic using structured data alone. The proportion of additional cases with each selected symptom identified in NLP-supplemented analysis varied across the selected symptoms, from 29% (63,742/220,631) of all records for cough to 64% (38,884/60,865) of all records with nausea or vomiting. Of the 295,305 symptomatic patients, the median time from symptom onset to testing was 3 days using structured data alone, whereas the NLP algorithm identified signs or symptoms approximately 1 day earlier. When validated against chart-reviewed cases, the NLP algorithm successfully identified signs and symptoms with consistently high sensitivity (ranging from 87% to 100%) and specificity (94% to 100%).
Conclusions: These findings demonstrate that NLP can identify and characterize a broad set of COVID-19 signs and symptoms from unstructured EMR data with enhanced detail and timeliness compared with structured data alone.
Keywords: COVID-19; NLP; application; artificial intelligence; cough; data; disease characterization; fever; headache; natural language processing; surveillance; symptoms.
©Deborah E Malden, Sara Y Tartof, Bradley K Ackerson, Vennis Hong, Jacek Skarbinski, Vincent Yau, Lei Qian, Heidi Fischer, Sally F Shaw, Susan Caparosa, Fagen Xie. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 30.12.2022.
Conflict of interest statement
Conflicts of Interest: SYT received a grant from Roche/Genentech, Inc. to support this work. SYT, BKA, VH, JS, VY, LQ, HF, SFS, SC, and FX received support for research time with this funding. VY works for Roche-Genentech. The funder had no role in the design, conduct, or analysis of this study, or to manuscript development.
Figures


Similar articles
-
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.Cochrane Database Syst Rev. 2021 Feb 23;2(2):CD013665. doi: 10.1002/14651858.CD013665.pub2. Cochrane Database Syst Rev. 2021. Update in: Cochrane Database Syst Rev. 2022 May 20;5:CD013665. doi: 10.1002/14651858.CD013665.pub3. PMID: 33620086 Free PMC article. Updated.
-
Virtualized clinical studies to assess the natural history and impact of gut microbiome modulation in non-hospitalized patients with mild to moderate COVID-19 a randomized, open-label, prospective study with a parallel group study evaluating the physiologic effects of KB109 on gut microbiota structure and function: a structured summary of a study protocol for a randomized controlled study.Trials. 2021 Apr 2;22(1):245. doi: 10.1186/s13063-021-05157-0. Trials. 2021. PMID: 33810796 Free PMC article.
-
PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes.J Biomed Inform. 2022 Jan;125:103951. doi: 10.1016/j.jbi.2021.103951. Epub 2021 Nov 13. J Biomed Inform. 2022. PMID: 34785382 Free PMC article.
-
Moving Biosurveillance Beyond Coded Data Using AI for Symptom Detection From Physician Notes: Retrospective Cohort Study.J Med Internet Res. 2024 Apr 4;26:e53367. doi: 10.2196/53367. J Med Internet Res. 2024. PMID: 38573752 Free PMC article.
-
Persistent symptoms following SARS-CoV-2 infection amongst children and young people: A meta-analysis of controlled and uncontrolled studies.J Infect. 2022 Feb;84(2):158-170. doi: 10.1016/j.jinf.2021.11.011. Epub 2021 Nov 20. J Infect. 2022. PMID: 34813820 Free PMC article.
Cited by
-
Automatic COVID-19 and Common-Acquired Pneumonia Diagnosis Using Chest CT Scans.Bioengineering (Basel). 2023 Apr 26;10(5):529. doi: 10.3390/bioengineering10050529. Bioengineering (Basel). 2023. PMID: 37237599 Free PMC article.
-
A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study.J Med Internet Res. 2023 Oct 4;25:e49944. doi: 10.2196/49944. J Med Internet Res. 2023. PMID: 37792444 Free PMC article.
-
Identifying Symptoms Prior to Pancreatic Ductal Adenocarcinoma Diagnosis in Real-World Care Settings: Natural Language Processing Approach.JMIR AI. 2024 Jan 15;3:e51240. doi: 10.2196/51240. JMIR AI. 2024. PMID: 38875566 Free PMC article.
-
Post-COVID conditions following COVID-19 vaccination: a retrospective matched cohort study of patients with SARS-CoV-2 infection.Nat Commun. 2024 May 22;15(1):4101. doi: 10.1038/s41467-024-48022-9. Nat Commun. 2024. PMID: 38778026 Free PMC article.
-
Identifying Asthma-Related Symptoms From Electronic Health Records Using a Hybrid Natural Language Processing Approach Within a Large Integrated Health Care System: Retrospective Study.JMIR AI. 2025 May 2;4:e69132. doi: 10.2196/69132. JMIR AI. 2025. PMID: 40611521 Free PMC article.
References
-
- Guan W, Ni Z, Hu Y, Liang W, Ou C, He J, Liu L, Shan H, Lei C, Hui DSC, Du B, Li L, Zeng G, Yuen K, Chen R, Tang C, Wang T, Chen P, Xiang J, Li S, Wang J, Liang Z, Peng Y, Wei L, Liu Y, Hu Y, Peng P, Wang J, Liu J, Chen Z, Li G, Zheng Z, Qiu S, Luo J, Ye C, Zhu S, Zhong N, China Medical Treatment Expert Group for Covid-19 Clinical Characteristics of Coronavirus Disease 2019 in China. N Engl J Med. 2020 Apr 30;382(18):1708–1720. doi: 10.1056/NEJMoa2002032. https://europepmc.org/abstract/MED/32109013 - DOI - PMC - PubMed
-
- WHO coronavirus (COVID-19) dashboard. World Health Organization. [2022-12-11]. https://covid19.who.int/
-
- Mao R, Qiu Y, He J, Tan J, Li X, Liang J, Shen J, Zhu L, Chen Y, Iacucci M, Ng SC, Ghosh S, Chen M. Manifestations and prognosis of gastrointestinal and liver involvement in patients with COVID-19: a systematic review and meta-analysis. The Lancet Gastroenterology & Hepatology. 2020 Jul;5(7):667–678. doi: 10.1016/S2468-1253(20)30126-6. https://europepmc.org/abstract/MED/32405603 S2468-1253(20)30126-6 - DOI - PMC - PubMed
-
- Tenforde MW, Billig Rose E, Lindsell CJ, Shapiro NI, Files DC, Gibbs KW, Prekker ME, Steingrub JS, Smithline HA, Gong MN, Aboodi MS, Exline MC, Henning DJ, Wilson JG, Khan A, Qadir N, Stubblefield WB, Patel MM, Self WH, Feldstein LR, CDC COVID-19 Response Team Characteristics of Adult Outpatients and Inpatients with COVID-19 - 11 Academic Medical Centers, United States, March-May 2020. MMWR Morb Mortal Wkly Rep. 2020 Jul 03;69(26):841–846. doi: 10.15585/mmwr.mm6926e3. doi: 10.15585/mmwr.mm6926e3. - DOI - DOI - PMC - PubMed
-
- Varatharaj A, Thomas N, Ellul MA, Davies NWS, Pollak TA, Tenorio EL, Sultan M, Easton A, Breen G, Zandi M, Coles JP, Manji H, Al-Shahi Salman R, Menon DK, Nicholson TR, Benjamin LA, Carson A, Smith C, Turner MR, Solomon T, Kneen R, Pett SL, Galea I, Thomas RH, Michael BD, CoroNerve Study Group Neurological and neuropsychiatric complications of COVID-19 in 153 patients: a UK-wide surveillance study. Lancet Psychiatry. 2020 Oct;7(10):875–882. doi: 10.1016/S2215-0366(20)30287-X. https://europepmc.org/abstract/MED/32593341 S2215-0366(20)30287-X - DOI - PMC - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous