Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Dec 19:rs.3.rs-3749510.
doi: 10.21203/rs.3.rs-3749510/v1.

Genetic and Survey Data Improves Performance of Machine Learning Model for Long COVID

Affiliations

Genetic and Survey Data Improves Performance of Machine Learning Model for Long COVID

Wei-Qi Wei et al. Res Sq. .

Abstract

Over 200 million SARS-CoV-2 patients have or will develop persistent symptoms (long COVID). Given this pressing research priority, the National COVID Cohort Collaborative (N3C) developed a machine learning model using only electronic health record data to identify potential patients with long COVID. We hypothesized that additional data from health surveys, mobile devices, and genotypes could improve prediction ability. In a cohort of SARS-CoV-2 infected individuals (n=17,755) in the All of Us program, we applied and expanded upon the N3C long COVID prediction model, testing machine learning infrastructures, assessing model performance, and identifying factors that contributed most to the prediction models. For the survey/mobile device information and genetic data, extreme gradient boosting and a convolutional neural network delivered the best performance for predicting long COVID, respectively. Combined survey, genetic, and mobile data increased specificity and the Area Under Curve the Receiver Operating Characteristic score versus the original N3C model.

PubMed Disclaimer

Conflict of interest statement

Declarations The authors declared no competing interests for this work.

Figures

Figure 1
Figure 1. Training, Testing, and Validation Set Parameters.
Flowchart describing how the AoU data was divided into sets based on data availability.
Figure 2
Figure 2. Model Output Sources and Predictive Method.
Flowchart showing how the predictive components of separate models are combined into a single predictive outcome.
Figure 3
Figure 3. Cohort Data Availability.
Venn diagram showing the data available for AoU participants.
Figure 4
Figure 4. Average ROC of Models.
Shows the Average ROC curve for each combination of models from cross validation results. Shaded area represents the standard deviation of the ROC curve, and color represents models used.
Figure 5
Figure 5. Top Twenty Predictive Features in the Combined Model.
The relative contribution of the 20 most powerful features in our predictive method. Color of each bar represents the data source of the feature.

Similar articles

References

    1. Yang C. & Tebbutt S. J. Long COVID: the next public health crisis is already on its way. The Lancet Regional Health – Europe 28, (2023). - PMC - PubMed
    1. WHO Coronavirus (COVID-19) Dashboard. https://covid19.who.int.
    1. Pfaff E. R. et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health 4, e532–e541 (2022). - PMC - PubMed
    1. Pfaff E. R. et al. De-black-boxing health AI: demonstrating reproducible machine learning computable phenotypes using the N3C-RECOVER Long COVID model in the All of Us data repository. J Am Med Inform Assoc 30, 1305–1312 (2023). - PMC - PubMed
    1. Lammi V. et al. Genome-wide Association Study of Long COVID. 2023.06.29.23292056 Preprint at 10.1101/2023.06.29.23292056 (2023). - DOI

Publication types

LinkOut - more resources