Natural language processing improves reliable identification of COVID-19 compared to diagnostic codes alone
- PMID: 40731247
- PMCID: PMC12335755
- DOI: 10.1093/aje/kwaf162
Natural language processing improves reliable identification of COVID-19 compared to diagnostic codes alone
Abstract
Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21 659 patients (of 1 560 564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% vs 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation.
Keywords: COVID-19; cohort identification; natural language processing; sample sizes.
© The Author(s) 2025. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site-for further information please contact journals.permissions@oup.com.
Figures
References
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
