Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2022 Dec 30;8(12):e41529.
doi: 10.2196/41529.

Natural Language Processing for Improved Characterization of COVID-19 Symptoms: Observational Study of 350,000 Patients in a Large Integrated Health Care System

Affiliations
Observational Study

Natural Language Processing for Improved Characterization of COVID-19 Symptoms: Observational Study of 350,000 Patients in a Large Integrated Health Care System

Deborah E Malden et al. JMIR Public Health Surveill. .

Abstract

Background: Natural language processing (NLP) of unstructured text from electronic medical records (EMR) can improve the characterization of COVID-19 signs and symptoms, but large-scale studies demonstrating the real-world application and validation of NLP for this purpose are limited.

Objective: The aim of this paper is to assess the contribution of NLP when identifying COVID-19 signs and symptoms from EMR.

Methods: This study was conducted in Kaiser Permanente Southern California, a large integrated health care system using data from all patients with positive SARS-CoV-2 laboratory tests from March 2020 to May 2021. An NLP algorithm was developed to extract free text from EMR on 12 established signs and symptoms of COVID-19, including fever, cough, headache, fatigue, dyspnea, chills, sore throat, myalgia, anosmia, diarrhea, vomiting or nausea, and abdominal pain. The proportion of patients reporting each symptom and the corresponding onset dates were described before and after supplementing structured EMR data with NLP-extracted signs and symptoms. A random sample of 100 chart-reviewed and adjudicated SARS-CoV-2-positive cases were used to validate the algorithm performance.

Results: A total of 359,938 patients (mean age 40.4 [SD 19.2] years; 191,630/359,938, 53% female) with confirmed SARS-CoV-2 infection were identified over the study period. The most common signs and symptoms identified through NLP-supplemented analyses were cough (220,631/359,938, 61%), fever (185,618/359,938, 52%), myalgia (153,042/359,938, 43%), and headache (144,705/359,938, 40%). The NLP algorithm identified an additional 55,568 (15%) symptomatic cases that were previously defined as asymptomatic using structured data alone. The proportion of additional cases with each selected symptom identified in NLP-supplemented analysis varied across the selected symptoms, from 29% (63,742/220,631) of all records for cough to 64% (38,884/60,865) of all records with nausea or vomiting. Of the 295,305 symptomatic patients, the median time from symptom onset to testing was 3 days using structured data alone, whereas the NLP algorithm identified signs or symptoms approximately 1 day earlier. When validated against chart-reviewed cases, the NLP algorithm successfully identified signs and symptoms with consistently high sensitivity (ranging from 87% to 100%) and specificity (94% to 100%).

Conclusions: These findings demonstrate that NLP can identify and characterize a broad set of COVID-19 signs and symptoms from unstructured EMR data with enhanced detail and timeliness compared with structured data alone.

Keywords: COVID-19; NLP; application; artificial intelligence; cough; data; disease characterization; fever; headache; natural language processing; surveillance; symptoms.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: SYT received a grant from Roche/Genentech, Inc. to support this work. SYT, BKA, VH, JS, VY, LQ, HF, SFS, SC, and FX received support for research time with this funding. VY works for Roche-Genentech. The funder had no role in the design, conduct, or analysis of this study, or to manuscript development.

Figures

Figure 1
Figure 1
Flow diagram describing the natural language processing algorithm for detecting signs and symptoms of COVID-19. EMR: electronic medical records.
Figure 2
Figure 2
A comparison between structured and unstructured data. (A) Proportion of patients with SARS-CoV-2 with identified selected symptoms reported through structured and unstructured electronic medical records (EMR) data, by sign or symptom. (B) Days between testing and reported symptom onset before and after supplementing structured data with unstructured data (this includes IDC-10 codes, COVID-19 test-related questionnaires, and symptoms collected via keywords or phrases). ICD: International Classification of Diseases.

Similar articles

Cited by

References

    1. Guan W, Ni Z, Hu Y, Liang W, Ou C, He J, Liu L, Shan H, Lei C, Hui DSC, Du B, Li L, Zeng G, Yuen K, Chen R, Tang C, Wang T, Chen P, Xiang J, Li S, Wang J, Liang Z, Peng Y, Wei L, Liu Y, Hu Y, Peng P, Wang J, Liu J, Chen Z, Li G, Zheng Z, Qiu S, Luo J, Ye C, Zhu S, Zhong N, China Medical Treatment Expert Group for Covid-19 Clinical Characteristics of Coronavirus Disease 2019 in China. N Engl J Med. 2020 Apr 30;382(18):1708–1720. doi: 10.1056/NEJMoa2002032. https://europepmc.org/abstract/MED/32109013 - DOI - PMC - PubMed
    1. WHO coronavirus (COVID-19) dashboard. World Health Organization. [2022-12-11]. https://covid19.who.int/
    1. Mao R, Qiu Y, He J, Tan J, Li X, Liang J, Shen J, Zhu L, Chen Y, Iacucci M, Ng SC, Ghosh S, Chen M. Manifestations and prognosis of gastrointestinal and liver involvement in patients with COVID-19: a systematic review and meta-analysis. The Lancet Gastroenterology & Hepatology. 2020 Jul;5(7):667–678. doi: 10.1016/S2468-1253(20)30126-6. https://europepmc.org/abstract/MED/32405603 S2468-1253(20)30126-6 - DOI - PMC - PubMed
    1. Tenforde MW, Billig Rose E, Lindsell CJ, Shapiro NI, Files DC, Gibbs KW, Prekker ME, Steingrub JS, Smithline HA, Gong MN, Aboodi MS, Exline MC, Henning DJ, Wilson JG, Khan A, Qadir N, Stubblefield WB, Patel MM, Self WH, Feldstein LR, CDC COVID-19 Response Team Characteristics of Adult Outpatients and Inpatients with COVID-19 - 11 Academic Medical Centers, United States, March-May 2020. MMWR Morb Mortal Wkly Rep. 2020 Jul 03;69(26):841–846. doi: 10.15585/mmwr.mm6926e3. doi: 10.15585/mmwr.mm6926e3. - DOI - DOI - PMC - PubMed
    1. Varatharaj A, Thomas N, Ellul MA, Davies NWS, Pollak TA, Tenorio EL, Sultan M, Easton A, Breen G, Zandi M, Coles JP, Manji H, Al-Shahi Salman R, Menon DK, Nicholson TR, Benjamin LA, Carson A, Smith C, Turner MR, Solomon T, Kneen R, Pett SL, Galea I, Thomas RH, Michael BD, CoroNerve Study Group Neurological and neuropsychiatric complications of COVID-19 in 153 patients: a UK-wide surveillance study. Lancet Psychiatry. 2020 Oct;7(10):875–882. doi: 10.1016/S2215-0366(20)30287-X. https://europepmc.org/abstract/MED/32593341 S2215-0366(20)30287-X - DOI - PMC - PubMed

Publication types