Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative

Timothy Bergquist¹, Johanna Loomba², Emily Pfaff³, Fangfang Xia⁴, Zixuan Zhao⁴, Yitan Zhu⁴, Elliot Mitchell⁵, Biplab Bhattacharya⁵, Gaurav Shetty⁵, Tamanna Munia⁵, Grant Delong⁵, Adbul Tariq⁵, Zachary Butzin-Dozier⁶, Yunwen Ji⁶, Haodong Li⁶, Jeremy Coyle⁶, Seraphina Shi⁶, Rachael V Philips⁶, Andrew Mertens⁶, Romain Pirracchio⁷, Mark van der Laan⁶, John M Colford Jr⁶, Alan Hubbard⁶, Jifan Gao⁸, Guanhua Chen⁸, Neelay Velingker⁹, Ziyang Li⁹, Yinjun Wu⁹, Adam Stein⁹, Jiani Huang⁹, Zongyu Dai⁹, Qi Long⁹, Mayur Naik⁹, John Holmes⁹, Danielle Mowery⁹, Eric Wong⁹, Ravi Parekh⁹, Emily Getzen⁹, Jake Hightower¹⁰, Jennifer Blase¹⁰; Long COVID Computational Challenge Participants; N3C Consortium

Collaborators, Affiliations

PMID: 39321500
PMCID: PMC11462169
DOI: 10.1016/j.ebiom.2024.105333

Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative

Timothy Bergquist et al. EBioMedicine. 2024 Oct.

. 2024 Oct:108:105333.

doi: 10.1016/j.ebiom.2024.105333. Epub 2024 Sep 24.

PMID: 39321500
PMCID: PMC11462169
DOI: 10.1016/j.ebiom.2024.105333

Abstract

Background: While many patients seem to recover from SARS-CoV-2 infections, many patients report experiencing SARS-CoV-2 symptoms for weeks or months after their acute COVID-19 ends, even developing new symptoms weeks after infection. These long-term effects are called post-acute sequelae of SARS-CoV-2 (PASC) or, more commonly, Long COVID. The overall prevalence of Long COVID is currently unknown, and tools are needed to help identify patients at risk for developing long COVID.

Methods: A working group of the Rapid Acceleration of Diagnostics-radical (RADx-rad) program, comprised of individuals from various NIH institutes and centers, in collaboration with REsearching COVID to Enhance Recovery (RECOVER) developed and organized the Long COVID Computational Challenge (L3C), a community challenge aimed at incentivizing the broader scientific community to develop interpretable and accurate methods for identifying patients at risk of developing Long COVID. From August 2022 to December 2022, participants developed Long COVID risk prediction algorithms using the National COVID Cohort Collaborative (N3C) data enclave, a harmonized data repository from over 75 healthcare institutions from across the United States (U.S.).

Findings: Over the course of the challenge, 74 teams designed and built 35 Long COVID prediction models using the N3C data enclave. The top 10 teams all scored above a 0.80 Area Under the Receiver Operator Curve (AUROC) with the highest scoring model achieving a mean AUROC of 0.895. Included in the top submission was a visualization dashboard that built timelines for each patient, updating the risk of a patient developing Long COVID in response to clinical events.

Interpretation: As a result of L3C, federal reviewers identified multiple machine learning models that can be used to identify patients at risk for developing Long COVID. Many of the teams used approaches in their submissions which can be applied to future clinical prediction questions.

Funding: Research reported in this RADx® Rad publication was supported by the National Institutes of Health. Timothy Bergquist, Johanna Loomba, and Emily Pfaff were supported by Axle Subcontract: NCATS-STSS-P00438.

Keywords: COVID-19; Community challenge; Evaluation; Long COVID; Machine learning; PASC.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests Danielle Mowery serves as an unpaid member of the Epic Cosmos Governing Council. Romain Pirracchio received funding from the FDA CERSI grant U01FD005978 and the PCORI grant P0562155 and received a consulting honorarium from Phillips. Martin van der Laan received funding from the NIAID grant 5R01AI074345. Johanna Loomba received contract funding from the NIH RECOVER program. Emily Pfaff received funding from the NIH and PCORI. The views expressed in this manuscript are solely those of the authors and do not necessarily represent those of the National Institutes of Health, the U.S. Department of Health and Human Services or the U.S. government. Qi Long was supported by grants from the NIH.

Figures

**Fig. 1**
**Censoring protocol of patient records.** All records available in the N3C enclave prior to (clinical history) and within 4 weeks after (4 week acute window) the COVID index date were available for use by the models. All records after the 4 week acute window were removed from the training and testing datasets. ICD-10-CM code U09.9 that occurred after the 4 week acute window indicated a patient with Long COVID.

**Fig. 2**
**Performance metrics for Convalesco’s highest scoring submission.** The calibration curves and area under the receiver operator curves from Convalesco’s highest scoring submission. Each sub-graph shows individual model performances from Convalesco’s submission. The “Main Model” is the model that was evaluated and scored for the L3C evaluation. Model 100 includes only 100 temporal features, Model 36 includes just the top 36 temporal features, and Model Z includes the same 100 temporal features but excludes racial information and data contributor identifiers. (a) The calibration curves from the model on the Hold Out Testing dataset. (b) The calibration curves from the model on the Two Site Testing dataset. (c) The calibration curves from the model on the level 3 post-challenge Limited Testing dataset. (d) The receiver operator curves from the model on the Hold Out Testing dataset. (e) The receiver operator curves from the model on the Two Site Testing dataset. (f) The receiver operator curves from the model on the level 3 post-challenge Limited Testing dataset. While the model wasn’t well calibrated to the Hold Out testing dataset, the model generalized well to two out of sample datasets from separate data contributing partners and improved further after re-training and evaluation on the level 3 limited dataset.

**Fig. 3**
**Interpretability dashboard from Convalesco’s submission.** The chart represents a prototype patient risk timeline. The top graph shows the single-event contributions toward the predicted PASC risk at Day 28. The risk change was calculated based on the difference between the final prediction and the hypothetical risk using all data except one event. Only a subsample of events are shown. The bottom chart shows the day-by-day predictions of cumulative risk based on events prior to the day.

See this image and copyright information in PMC

References

1. National Center for Health Statistics . Long COVID; 2023. U.S. Census Bureau, household pulse survey, 2022–2023.https://www.cdc.gov/nchs/covid19/pulse/long-covid.htm
1. CDC . Centers for Disease Control and Prevention; 2023. Long COVID or post-COVID conditions.https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html
1. Deer R.R., Rock M.A., Vasilevsky N., et al. Characterizing long COVID: deep phenotype of a complex condition. EBioMedicine. 2021;74 - PMC - PubMed
1. Brightling C.E., Evans R.A. Long COVID: which symptoms can be attributed to SARS-CoV-2 infection? Lancet. 2022;400:411–413. - PMC - PubMed
1. Pfaff E.R., Girvin A.T., Bennett T.D., et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digital Health. 2022;4(7):e532–e541. doi: 10.1016/S2589-7500(22)00048-6. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative

Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous