Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct:96:104777.
doi: 10.1016/j.ebiom.2023.104777. Epub 2023 Sep 4.

Predictive models of long COVID

Collaborators, Affiliations

Predictive models of long COVID

Blessy Antony et al. EBioMedicine. 2023 Oct.

Abstract

Background: The cause and symptoms of long COVID are poorly understood. It is challenging to predict whether a given COVID-19 patient will develop long COVID in the future.

Methods: We used electronic health record (EHR) data from the National COVID Cohort Collaborative to predict the incidence of long COVID. We trained two machine learning (ML) models - logistic regression (LR) and random forest (RF). Features used to train predictors included symptoms and drugs ordered during acute infection, measures of COVID-19 treatment, pre-COVID comorbidities, and demographic information. We assigned the 'long COVID' label to patients diagnosed with the U09.9 ICD10-CM code. The cohorts included patients with (a) EHRs reported from data partners using U09.9 ICD10-CM code and (b) at least one EHR in each feature category. We analysed three cohorts: all patients (n = 2,190,579; diagnosed with long COVID = 17,036), inpatients (149,319; 3,295), and outpatients (2,041,260; 13,741).

Findings: LR and RF models yielded median AUROC of 0.76 and 0.75, respectively. Ablation study revealed that drugs had the highest influence on the prediction task. The SHAP method identified age, gender, cough, fatigue, albuterol, obesity, diabetes, and chronic lung disease as explanatory features. Models trained on data from one N3C partner and tested on data from the other partners had average AUROC of 0.75.

Interpretation: ML-based classification using EHR information from the acute infection period is effective in predicting long COVID. SHAP methods identified important features for prediction. Cross-site analysis demonstrated the generalizability of the proposed methodology.

Funding: NCATS U24 TR002306, NCATS UL1 TR003015, Axle Informatics Subcontract: NCATS-P00438-B, NIH/NIDDK/OD, PSR2015-1720GVALE_01, G43C22001320007, and Director, Office of Science, Office of Basic Energy Sciences of the U.S. Department of Energy Contract No. DE-AC02-05CH11231.

Keywords: COVID-19; Classification; Cross-site analysis; Explainability; Long COVID.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests J Loomba received consulting fees from Axle Informatics as a subject matter expert for RadxUp Long COVID computational challenge (L3C). The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Definition of all patient, inpatient, and outpatient cohorts. The number of patients at each stage of the definition of the cohort of all patients. The dataset used for training and testing the prediction models consisted of 2,190,579 patients (data from 39 data partner sites) having at least one record in any of the five feature categories — comorbidities, drugs, symptoms, demographics, and measures of COVID-19 treatment. Of these COVID-19 positive patients, the number of long COVID patients, i.e., diagnosed with ICD-10-CM code U09.9, was 17,036.
Fig. 2
Fig. 2
Long COVID prediction pipeline. Overview of the classification pipeline implemented for the prediction of long COVID.
Fig. 3
Fig. 3
Evaluation of long COVID prediction models in all three patient cohorts. Distribution of (a) AUROC and (b) AUPRC scores from ten iterations of long COVID classification using logistic regression and random forest models for all patients, inpatients and outpatients. In each boxplot, the lower endpoint, the line in the middle, and the higher endpoint denote the first, second, and third quartiles of the distribution. The whiskers span 1.5 times the interquartile range. Diamonds denote values outside this range. The grey dotted line represents the expected score of a random predictor in the all-patient cohort.
Fig. 4
Fig. 4
Importance of drug features in long COVID prediction in all three patient cohorts. The x-coordinate of each point is the AUROC score of a feature category combination and the y-coordinate is the score of the same combination but after including drug features. Each cohort is represented by a unique color and has 70 points (seven pairs of feature combinations and ten iterations each). The grey dotted line represents the x = y line.
Fig. 5
Fig. 5
Importance of features in long COVID prediction models. Each row (along the y-axis) corresponds to a feature. The x-axis represents the mean absolute value of SHAP values of the given feature over all test set samples in one iteration. Each boxplot shows the distribution of these mean values for one feature across the iterations (maximum ten) in which it was selected by the Boruta method. The features are sorted in decreasing order of the median of the distribution of their mean absolute SHAP values. In each boxplot, the lower endpoint, the line in the middle, and the higher endpoint denote the first, second, and third quartiles of the distribution. The whiskers span 1.5 times the interquartile range. Diamonds denote values outside this range. The legend displays the mapping between feature category and colour.
Fig. 6
Fig. 6
Performance of long COVID prediction models in cross-site analysis. Results of cross-site analysis where we train a prediction model on data from only one data partner site and test on data from all other data partners. Distribution of AUROC values from ten iterations of prediction using logistic regression and random forest models when the training dataset comprises data from only (a) data partner 1 and (b) data partner 2. In each boxplot, the lower endpoint, the line in the middle, and the higher endpoint denote the first, second, and third quartiles of the distribution. The whiskers span 1.5 times the interquartile range. Diamonds denote values outside this range. The grey dotted line represents the expected score of a random predictor in the all-patient cohort.

References

    1. Bennett T.D., Moffitt R.A., Hajagos J.G., et al. Clinical characterization and prediction of clinical severity of SARS-CoV-2 infection among US adults using data from the US national COVID cohort collaborative. JAMA Netw Open. 2021;4(7) - PMC - PubMed
    1. Reese J.T., Coleman B., Chan L., et al. NSAID use and clinical outcomes in COVID-19 patients: a 38-center retrospective cohort study. Virol J. 2022;19(1):84. - PMC - PubMed
    1. CDC. Centers for Disease Control and Prevention Post-COVID conditions: overview for healthcare providers. 2020. https://www.cdc.gov/coronavirus/2019-ncov/hcp/clinical-care/post-covid-c... [cited 2022 Aug 31]. Available from:
    1. Tsampasian V., Elghazaly H., Chattopadhyay R., et al. Risk factors associated with post−COVID-19 condition: a systematic review and meta-analysis. JAMA Intern Med. 2023;183(6):566–580. - PMC - PubMed
    1. Subramanian A., Nirantharakumar K., Hughes S., et al. Symptoms and risk factors for long COVID in non-hospitalized adults. Nat Med. 2022;28:1–9. - PMC - PubMed

Grants and funding