Extracting post-acute sequelae of SARS-CoV-2 infection symptoms from clinical notes via hybrid natural language processing

Affiliations

¹ Population Health Sciences, Weill Cornell Medicine, New York, USA.
² Nemours Children's Health, Wilmington, USA.
³ RECOVER Patient, Caregiver, or Community Advocate Representative, New York, USA.
⁴ Applied Clinical Research Center, Children's Hospital of Philadelphia, Philadelphia, USA.
⁵ Louisiana Public Health Institute, New Orleans, USA.

PMID: 40958972
PMCID: PMC12435580
DOI: 10.1038/s44401-025-00033-4

Extracting post-acute sequelae of SARS-CoV-2 infection symptoms from clinical notes via hybrid natural language processing

Zilong Bai et al. Npj Health Syst. 2025.

. 2025 Aug 21:2:10.1038/s44401-025-00033-4.

doi: 10.1038/s44401-025-00033-4. Online ahead of print.

Authors

Affiliations

¹ Population Health Sciences, Weill Cornell Medicine, New York, USA.
² Nemours Children's Health, Wilmington, USA.
³ RECOVER Patient, Caregiver, or Community Advocate Representative, New York, USA.
⁴ Applied Clinical Research Center, Children's Hospital of Philadelphia, Philadelphia, USA.
⁵ Louisiana Public Health Institute, New Orleans, USA.

PMID: 40958972
PMCID: PMC12435580
DOI: 10.1038/s44401-025-00033-4

Abstract

Accurately and efficiently diagnosing Post-Acute Sequelae of COVID-19 (PASC) remains challenging due to its myriad symptoms that evolve over long- and variable-time intervals. To address this issue, we developed a hybrid natural language processing pipeline that integrates rule-based named entity recognition with BERT-based assertion detection modules for PASC-symptom extraction and assertion detection from clinical notes. We developed a comprehensive PASC lexicon with clinical specialists. From 11 health systems of the RECOVER initiative network across the U.S., we curated 160 intake progress notes for model development and evaluation, and collected 47,654 progress notes for a population-level prevalence study. We achieved an average F1 score of 0.82 in one-site internal validation and 0.76 in 10-site external validation for assertion detection. Our pipeline processed each note at 2.448 ± 0.812 seconds on average. Spearman correlation tests showed ρ > 0.83 for positive mentions and ρ > 0.72 for negative ones, both with P < 0.0001. These demonstrate the effectiveness and efficiency of our models and its potential for improving PASC diagnosis.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

**Fig. 1 |**
PASC Lexicon with Examples of Representative Symptoms.

**Fig. 2 |**
The architecture of the NLP pipeline.

**Fig. 3 |. The performance of three BERT variants on the WCM internal validation set.**
Boxplots for the performance metrics—(a) Precision, (b) Recall, and (c) F1-score. ns not significant; **P ≤ 0.01; ***P ≤ 0.001; ****P ≤ 0.0001.

**Fig. 4 |. Multi-site External Validation.**
Radar Chart for the performance metrics—(a) precision, (b) recall, and (c) F1-score—of the three fine-tuned MedText-BERT pipeline variants on the 100-note non-WCM multi-site external validation set. The metrics are computed for the positive (i.e., “Present”) symptom mentions. The BERT-based models trained/fine-tuned in different scenarios for assertion detection are: BioBERT fine-tuned, BiomedBERT fine-tuned, BiomedBERT benchmark, and ClinicalBERT fine-tuned from left to right in each subfigure. a Seattle - Seattle Children’s; (b) Monte - Montefiore Medical Center; (c) CHOP - The Children’s Hospital of Philadelphia; (d) OCHIN - Oregon Community Health Information Network; (e) Missouri - University of Missouri; (f) CCHMC - Cincinnati Children’s Hospital Medical Center; (g) Nemours - Nemours Children’s Health; (g) MCW - Medical College of Wisconsin; (i) Nationwide - Nationwide Children’s Hospital; (j) UTSW - UT Southwestern Medical Center.

**Fig. 5 |. Population-level Prevalence Study.**
A Frequency analysis of positive (“present”) in red and negative (“non-present”) in blue symptom category occurrences in different sites. B Spearman correlation coefficients between the positive (i.e., “present”) symptom mentioning patterns of sites and the overall dataset. C Spearman correlation coefficients between the negative (i.e., “non-present”) symptom-mentioning patterns of sites and the overall dataset. (a) Seattle Children’s, (b) Montefiore Medical Center, (c) The Children’s Hospital of Philadelphia, (d) Oregon Community Health Information Network, (e) University of Missouri, (f) Cincinnati Children’s Hospital Medical Center, (g) Nemours Children’s Health System, (h) Medical College of Wisconsin, (i) Nationwide Children’s Hospital, (j) UT Southwestern Medical Center, (k) Weill Cornell Medicine, (l) Total.

**Fig. 6 |. Cross-symptom-category correlation test and symptom-category distribution.**
Spearman correlation coefficients between the positive (i.e., “Present”) symptom-mentioning patterns of symptom categories. The total count of positive symptom mentions for each symptom category is to the right of the correlation diagram.

**Fig. 7 |. Module-wise runtime summary of MedText processing.**
The mean (shown by the values on each bar) and standard deviation of the runtime of each MedText module in MedText computed for the mean runtime per note across the 11 sites.

**Fig. 8 |**
Data construction workflow.

**Fig. 9 |. Example screenshot for Screen_Tool.**
Screen_Tool is an open-source R-based software for manual annotation of symptom mention.

See this image and copyright information in PMC

References

1. National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division; Board on Global Health; Board on Health Sciences Policy; Committee on Examining the Working Definition for Long COVID. A Long COVID Definition: A Chronic, Systemic Disease State with Profound Consequences. (National Academies Press (US), Washington (DC), (2024).
1. Thaweethai T et al. Development of a definition of postacute sequelae of SARS-CoV-2 infection. JAMA 329, 1934–1946 (2023). - PMC - PubMed
1. Ford ND et al. Long COVID and significant activity limitation among adults, by age - United States, June 1–13, 2022, to June 7–19, 2023. MMWR Morb. Mortal. Wkly. Rep 72, 866–870 (2023). - PMC - PubMed
1. Ballering AV, van Zon SKR, Olde Hartman TC, Rosmalen JGM & Lifelines Corona Research Initiative Persistence of somatic symptoms after COVID-19 in the Netherlands: an observational cohort study. Lancet 400, 452–461 (2022). - PMC - PubMed
1. Davis HE, McCorkell L, Vogel JM & Topol EJ Long COVID: major findings, mechanisms and recommendations. Nat. Rev. Microbiol 21, 133–146 (2023). - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extracting post-acute sequelae of SARS-CoV-2 infection symptoms from clinical notes via hybrid natural language processing

Affiliations

Extracting post-acute sequelae of SARS-CoV-2 infection symptoms from clinical notes via hybrid natural language processing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous