Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 4;194(11):3348-3354.
doi: 10.1093/aje/kwaf162.

Natural language processing improves reliable identification of COVID-19 compared to diagnostic codes alone

Affiliations

Natural language processing improves reliable identification of COVID-19 compared to diagnostic codes alone

Nathaniel Hendrix et al. Am J Epidemiol. .

Abstract

Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21 659 patients (of 1 560 564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% vs 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation.

Keywords: COVID-19; cohort identification; natural language processing; sample sizes.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Receiver operating characteristic curve for the three NLP classifiers. AUC = Area Under Curve, RNN = Recurrent Neural Network.
Figure 2:
Figure 2:
ROC curves of the XGBoost classifier based on training datasets of different sizes. Training data of 3000 samples or fewer uses 100 bootstrap samples to determine the mean ROC curve (confidence intervals omitted for clarity, but available in text).

References

    1. Pfaff ER, Girvin AT, Bennett TD, et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health. 2022;4(7):e532–e541. doi: 10.1016/S2589-7500(22)00048-6 - DOI - PMC - PubMed
    1. Wong HL, Hu M, Zhou CK, et al. Risk of myocarditis and pericarditis after the COVID-19 mRNA vaccination in the USA: a cohort study in claims databases. The Lancet. 2022;399(10342):2191–2199. doi: 10.1016/S0140-6736(22)00791-7 - DOI - PMC - PubMed
    1. Bhatt AS, McElrath EE, Claggett BL, et al. Accuracy of ICD-10 Diagnostic Codes to Identify COVID-19 Among Hospitalized Patients. J Gen Intern Med. 2021;36(8):2532–2535. doi: 10.1007/s11606-021-06936-w - DOI - PMC - PubMed
    1. Lynch KE, Viernes B, Gatsby E, et al. Positive Predictive Value of COVID-19 ICD-10 Diagnosis Codes Across Calendar Time and Clinical Setting. Clin Epidemiol. 2021;13:1011–1018. doi: 10.2147/CLEP.S335621 - DOI - PMC - PubMed
    1. Moll K, Hobbi S, Zhou CK, et al. Assessment of performance characteristics of COVID-19 ICD-10-CM diagnosis code U07.1 using SARS-CoV-2 nucleic acid amplification test results. PLOS ONE. 2022;17(8):e0273196. doi: 10.1371/journal.pone.0273196 - DOI - PMC - PubMed