. 2022 Jul;4(7):e532-e541.

doi: 10.1016/S2589-7500(22)00048-6. Epub 2022 May 16.

Identifying who has long COVID in the USA: a machine learning approach using N3C data

Collaborators, Affiliations

Collaborators

N3C Consortium:
Carolyn Bramante, David Dorr, Michele Morris, Ann M Parker, Hythem Sidky, Ken Gersing, Stephanie Hong, Emily Niehaus

Affiliations

¹ Department of Medicine, UNC Chapel Hill School of Medicine, Chapel Hill, NC, USA. Electronic address: epfaff@email.unc.edu.
² Palantir Technologies, Denver, CO, USA.
³ Section of Informatics and Data Science, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA; Section of Critical Care Medicine, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁴ Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
⁵ Colorado Center for Personalised Medicine, Division of Biomedical Informatics & Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁶ Department of Nutrition, Metabolism, and Rehabilitation Sciences, University of Texas Medical Branch, Galveston, TX, USA.
⁷ Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁸ Division of Pulmonary and Critical Care Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁹ Section of Informatics and Data Science, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
¹⁰ The OHDSI Center at the Roux Institute, Northeastern University, Portland, ME, USA.
¹¹ Center for Health AI, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
¹² Department of Biomedical Informatics, Stony Brook Cancer Center, Stony Brook University, Stony Brook, NY, USA.
¹³ Section of Biomedical Informatics and Data Science, Johns Hopkins University, Baltimore, MD, USA.

PMID: 35589549
PMCID: PMC9110014
DOI: 10.1016/S2589-7500(22)00048-6

Identifying who has long COVID in the USA: a machine learning approach using N3C data

Emily R Pfaff et al. Lancet Digit Health. 2022 Jul.

. 2022 Jul;4(7):e532-e541.

doi: 10.1016/S2589-7500(22)00048-6. Epub 2022 May 16.

Authors

Collaborators

N3C Consortium:
Carolyn Bramante, David Dorr, Michele Morris, Ann M Parker, Hythem Sidky, Ken Gersing, Stephanie Hong, Emily Niehaus

Affiliations

¹ Department of Medicine, UNC Chapel Hill School of Medicine, Chapel Hill, NC, USA. Electronic address: epfaff@email.unc.edu.
² Palantir Technologies, Denver, CO, USA.
³ Section of Informatics and Data Science, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA; Section of Critical Care Medicine, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁴ Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
⁵ Colorado Center for Personalised Medicine, Division of Biomedical Informatics & Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁶ Department of Nutrition, Metabolism, and Rehabilitation Sciences, University of Texas Medical Branch, Galveston, TX, USA.
⁷ Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁸ Division of Pulmonary and Critical Care Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
⁹ Section of Informatics and Data Science, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
¹⁰ The OHDSI Center at the Roux Institute, Northeastern University, Portland, ME, USA.
¹¹ Center for Health AI, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
¹² Department of Biomedical Informatics, Stony Brook Cancer Center, Stony Brook University, Stony Brook, NY, USA.
¹³ Section of Biomedical Informatics and Data Science, Johns Hopkins University, Baltimore, MD, USA.

PMID: 35589549
PMCID: PMC9110014
DOI: 10.1016/S2589-7500(22)00048-6

Abstract

Background: Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies of electronic health records are a crucial element of the US National Institutes of Health's RECOVER Initiative, which is addressing the urgent need to understand long COVID, identify treatments, and accurately identify who has it-the latter is the aim of this study.

Methods: Using the National COVID Cohort Collaborative's (N3C) electronic health record repository, we developed XGBoost machine learning models to identify potential patients with long COVID. We defined our base population (n=1 793 604) as any non-deceased adult patient (age ≥18 years) with either an International Classification of Diseases-10-Clinical Modification COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. We examined demographics, health-care utilisation, diagnoses, and medications for 97 995 adults with COVID-19. We used data on these features and 597 patients from a long COVID clinic to train three machine learning models to identify potential long COVID among all patients with COVID-19, patients hospitalised with COVID-19, and patients who had COVID-19 but were not hospitalised. Feature importance was determined via Shapley values. We further validated the models on data from a fourth site.

Findings: Our models identified, with high accuracy, patients who potentially have long COVID, achieving areas under the receiver operator characteristic curve of 0·92 (all patients), 0·90 (hospitalised), and 0·85 (non-hospitalised). Important features, as defined by Shapley values, include rate of health-care utilisation, patient age, dyspnoea, and other diagnosis and medication information available within the electronic health record.

Interpretation: Patients identified by our models as potentially having long COVID can be interpreted as patients warranting care at a specialty clinic for long COVID, which is an essential proxy for long COVID diagnosis as its definition continues to evolve. We also achieve the urgent goal of identifying potential long COVID in patients for clinical trials. As more data sources are identified, our models can be retrained and tuned based on the needs of individual studies.

Funding: US National Institutes of Health and National Center for Advancing Translational Sciences through the RECOVER Initiative.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests ATG is an employee of Palantir Technologies. ERP, JPD, SEJ, RRD, CGC, TDB, JAM, RM, AW, and MAH report research funding from the NIH. ERP and MGK report research funding from PCORI. MAH and JAM are co-founders of Pryzm Health. All other authors declare no competing interests.

Figures

**Figure 1**
Temporal windows for machine learning model inclusion We searched for health-care visits, medical conditions, and prescription medication orders before and after each patient's COVID-19 index date, up to a maximum of 365 days post-index. We ignored all data occurring in a buffer period of 45 days before and after the COVID-19 index date to differentiate pre-COVID-19 and post-COVID-19 from acute COVID-19. For patients who attended a long COVID clinic, we ignored all data occurring on or after their first visit to such a clinic to avoid influencing the model with clinical observations occurring as a result of the patient's long COVID assessment.

**Figure 2**
Machine learning model performance in identifying potential long COVID in patients ROC curves, with 5-fold cross-validation and five repeats, identifying the ability of each of the three models (non-hospitalised, hospitalised, and all patients) to classify patients with long COVID as the discrimination threshold is varied. To emphasise recall of patients with potential long COVID, all models use a predicted probability threshold of 0·45 to generate the precision, recall, and F-score. The threshold can be adjusted to emphasise precision or recall, depending on the use case. AUROC=area under the receiver operating characteristic curve. ROC=receiver operating characteristic.

**Figure 3**
Most important model features associated with visits to a long COVID clinic The top 20 features for each model are shown. Each point on the plot is a Shapley (importance) value for a single patient. The color of each point represents the magnitude and direction of the value of that feature for that patient. The point's position on the horizontal axis represents the importance and direction of that feature for the prediction for that patient. Some features are important predictors in all models (eg, outpatient utilisation, dyspnoea, and COVID-19 vaccine), whereas others are specific to one or two of the models (eg, dyssomnia or dexamethasone). Conditions labelled as chronic were diagnosed in patients before their COVID-19 index. Diabetes was not separated by type. dx=diagnosis. med=medication.

**Figure 4**
Univariate odds ratios for important model features Shown are the relative feature importance and univariate odds ratios for the top features (union of the 20 most important features) in each model. Regardless of importance, some features are significantly more prominent in the long COVID clinic population, while others are more prominent in the non-long COVID clinic population. ·· denotes that the feature was not in the top 20 features for the model in that column. Conditions labelled chronic were associated with patients before their COVID-19 index. Diabetes was not separated by type. dx=diagnosis. med=medication. *Odds ratios exclude age, which has a non-linear relationship with long COVID.

**Figure 5**
Example paths taken by the machine learning models to classify patients with potential long COVID Force plots showing the contribution of individual features to the final predicted probability of long COVID, as generated for individual patients by the all-patients model (A), hospitalised model (B), and non-hospitalised model (C). Features in red increase the predicted probability of long COVID classification by the model, whereas features in blue decrease that probability. The length of the bar for a given feature is proportional to the effect that feature has on the prediction for that patient. The final predicted probability is shown in bold. GERD=gastroesophageal reflux disease.

See this image and copyright information in PMC

References

1. Puelles VG, Lütgehetmann M, Lindenmeyer MT, et al. Multiorgan and renal tropism of SARS-CoV-2. N Engl J Med. 2020;383:590–592. - PMC - PubMed
1. Gavriatopoulou M, Korompoki E, Fotiou D, et al. Organ-specific manifestations of COVID-19 infection. Clin Exp Med. 2020;20:493–506. - PMC - PubMed
1. Nalbandian A, Sehgal K, Gupta A, et al. Post-acute COVID-19 syndrome. Nat Med. 2021;27:601–615. - PMC - PubMed
1. Greenhalgh T, Knight M, A'Court C, Buxton M, Husain L. Management of post-acute covid-19 in primary care. BMJ. 2020;370 - PubMed
1. Huang Y, Pinto MD, Borelli JL, et al. COVID symptoms, symptom clusters, and predictors for becoming a long-hauler: looking for clarity in the haze of the pandemic. medRxiv. 2021 doi: 10.1101/2021.03.03.21252086. published online March 5. (preprint). - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying who has long COVID in the USA: a machine learning approach using N3C data

Collaborators

Affiliations

Identifying who has long COVID in the USA: a machine learning approach using N3C data

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous