Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jun 11:2023.09.11.23295259.
doi: 10.1101/2023.09.11.23295259.

Finding Long-COVID: Temporal Topic Modeling of Electronic Health Records from the N3C and RECOVER Programs

Affiliations

Finding Long-COVID: Temporal Topic Modeling of Electronic Health Records from the N3C and RECOVER Programs

Shawn T O'Neil et al. medRxiv. .

Update in

Abstract

Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.

PubMed Disclaimer

Conflict of interest statement

Competing Interests The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. Experimental design summary.
(1) We trained an LDA topic model on a broad set of N3C patient data, tuning and evaluating the model with a held-out validation set using the UCI coherence metric. (2) Within a separate held-out assessment patient set, we defined three cohorts: PASC (patients with Long COVID), COVID (COVID-19 only), and Control (neither). For these patients we defined a 1-year pre-infection phase 6-month post-infection phase, utilizing a mock infection date for Control patients. (3) For the top 20 conditions per topic, we assessed new onset rates for COVID and PASC patients compared to Controls in the post-infection phase. (4) Finally, we defined per-topic logistic models, with outcome variables as the topic model’s assigned probabilities to individual patient phase data. Model coefficients then relate patient demographics, cohort, infection phase, and combinations of these factors to topic assignment for further study.
Figure 2:
Figure 2:. Word clouds illustrating top-weighted conditions for selected topics.
Conditions are sized according to probability within each topic and colored according to relevance, with positive relevance indicating conditions more probable in the topic than overall. Each condition displays the numeric OMOP concept ID encoding the relevant medical code used for clustering, as well as the first few words of the condition name. Per-topic statistics in panel headers show usage of each of each topic across sites (U, rounded to nearest 0.1%), topic uniformity across sites (H, 0–1, higher values being more uniform), and relative topic quality as a normalized coherence score (C, z-score, higher values being more coherent).
Figure 3:
Figure 3:. Increased and decreased new-onset conditions in PASC and COVID patients compared to Controls post-infection.
The x-axis shows estimated odds ratios and the y-axis shows the adjusted p-values for new incidence of top-weighted, positive-relevance terms from all topics amongst COVID (left) and PASC (right) cohorts compared to Controls, in the six-month post-acute period compared to the previous year. Many known PASC-associated conditions increased in both cohorts, while some conditions are cohort-specific. Additionally, in the COVID cohort, incidence of many conditions associated with regular care or screening is reduced compared to controls.
Figure 4:
Figure 4:. Topics with significant OR estimates >2 for at least two demographic groups.
The top row illustrates topics using the same color and size scales as Figure 2; OR estimates are shown for demographic-specific contrasts of PASC or COVID pre-vs-post odds ratios compared to similar Control odds ratios. For example, adult PASC patients increase odds of generating conditions from T-23 post-infection nearly 10 times more than Controls do over a similar timeframe (see Results). Lines show 95% confidence intervals for estimates; semi-transparent estimates are shown for context but were not significant after multiple-test correction.
Figure 5:
Figure 5:. Other select topics with demographic or cohort-specific trends.
T-8 is statistically significant only for COVID adults compared to controls. Topics 72 and 77 include diffuse sets of conditions, while T-36 is reduced for PASC pediatric and senior patients, despite representing known PASC outcomes (see Discussion).

References

    1. Brüssow H. & Timmis K. COVID-19: long covid and its societal consequences. Environ. Microbiol. 23, 4077–4091 (2021). - PubMed
    1. Reardon S. Long COVID risk falls only slightly after vaccination, huge study shows. Nature Publishing Group UK; 10.1038/d41586-022-01453-0 (2022) doi:10.1038/d41586-022-01453-0. - DOI - DOI - PubMed
    1. Fernández-de-Las-Peñas C. et al. Prevalence of post-COVID-19 symptoms in hospitalized and non-hospitalized COVID-19 survivors: A systematic review and meta-analysis. Eur. J. Intern. Med. 92, 55–70 (2021). - PMC - PubMed
    1. Han Q., Zheng B., Daines L. & Sheikh A. Long-Term Sequelae of COVID-19: A Systematic Review and Meta-Analysis of One-Year Follow-Up Studies on Post-COVID Symptoms. Pathogens 11, (2022). - PMC - PubMed
    1. Nalbandian A. et al. Post-acute COVID-19 syndrome. Nat. Med. 27, 601–615 (2021). - PMC - PubMed

Publication types

Grants and funding

LinkOut - more resources