Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 4;8(5):ooaf089.
doi: 10.1093/jamiaopen/ooaf089. eCollection 2025 Oct.

A natural language processing pipeline for identifying pediatric long COVID symptoms and functional impacts in freeform clinical notes: a RECOVER study

Collaborators, Affiliations

A natural language processing pipeline for identifying pediatric long COVID symptoms and functional impacts in freeform clinical notes: a RECOVER study

H Timothy Bunnell et al. JAMIA Open. .

Abstract

Objective: To develop a natural language processing (NLP) pipeline for unstructured electronic health record (EHR) data to identify symptoms and functional impacts associated with Long COVID in children.

Materials and methods: We analyzed 48 287 outpatient progress notes from 10 618 pediatric patients from 12 institutions. We evaluated notes obtained 28 to 179 days after a COVID-19 diagnosis or positive test. Two samples were examined: patients with evidence of Long COVID and patients with acute COVID but no evidence of Long COVID based on diagnostic codes. The pipeline identified clinical concepts associated with 21 symptoms and 4 functional impact categories. Subject matter experts (SMEs) screened a sample of 4586 terms from the NLP output to assess pipeline accuracy. Prevalence and concordance of each of the 25 concepts was compared between the 2 patient samples.

Results: A binary assertion measure comparing SME and NLP assertions showed moderate accuracy (N = 4133; F1 = .80) and improved substantially when only high-confidence SME assertions were considered (N = 2043; F1 = .90). Overall, the 25 Long COVID concept categories were markedly more prevalent in the presumptive Long COVID cohort, and differences were noted between concepts identified in notes versus structured data.

Discussion: This preliminary analysis illustrates the additional insight into a syndrome such as Long COVID gained from incorporating notes data, characterizing symptoms and functional impacts.

Conclusion: These data support the importance of incorporating NLP methodology when possible into designing computable phenotypes and to accurately characterize patients with Long COVID.

Keywords: NLP; PEDSnet; RECOVER; pediatrics.

PubMed Disclaimer

Conflict of interest statement

R.J. is a consultant for AstraZeneca, Seqirus, Dynavax, receives an editorial stipend from Elsevier and Pediatric Infectious Diseases Society and royalties from Up To Date/Wolters Kluwer. P.P. reports funding from the National Institute of Health and Bayer Pharmaceuticals. M.A.D. is funded by the National Heart, Lung, Blood Institute (NHLBI) award number: 1K01HL169493-01. S.R. reports prior grant support from GSK and Biofire and is a consultant for Sequiris. L.C.B. has received grants from Patient-Centered Outcomes Research Institute. All other authors have no conflicts of interest to report.

Figures

Figure 1.
Figure 1.
Schematic pipeline structure. Following note selection and cleaning, a Spark NLP pipeline performs NER and assigns assertion status to entities tagged as symptoms. Regular expressions are then used to identify terms associated with 25 Long COVID-related concepts. Those assigned an assertion status of Present via a composite assertion model constitute the features assigned to each note.
Figure 2.
Figure 2.
Adjusted odds ratios and 95% CI for each Long COVID feature comparing feature prevalence among Long COVID versus COVID patients. Ratios greater than 1 with CIs that do not extend below 1 indicate the feature was significantly more prevalent in Long COVID patients.
Figure 3.
Figure 3.
Within-patient adjusted odds ratios and 95% CI for each Long COVID feature comparing prevalence of concepts present in notes versus those present in structured data. Ratios greater than 1 with CIs that do not extend below 1 indicate the feature was significantly more prevalent in notes data. Ratios less than 1 with CIs that do not extend above 1 indicate features significantly more common in structured data.
Figure 4.
Figure 4.
Comparison of feature distribution over patients for features identified through structured versus unstructured analyses.
Figure 5.
Figure 5.
Patient counts for each concept from unstructured data only (UNS only), structured data only (STR only), or in both data sources (STR+UNS). Counts replaced by—have 5 or fewer patients.
Figure 6.
Figure 6.
Frequency of concept mentions asserted present within each site (relative to total mentions per site). (A) Pair-wise correlations between sites with each cell showing the correlation in the number of mentions for each of the 25 concepts between the intersecting sites. The correlation is printed in each cell and cell shading reflects the probability of the correlation arising by change based on Bonferroni correction for 66 multiple comparisons. (B) Relative concept frequency by concept and site. Heavy black line is the average of all sites, heavy red line is Site A, the site that appears to be less related to other sites in terms of concept frequencies.

References

    1. Rytter MJH. Difficult questions about long COVID in children. Lancet Child Adolesc Health. 2022;6:595-597. 10.1016/S2352-4642(22)00167-5 - DOI - PMC - PubMed
    1. Courtney D, Watson P, Battaglia M, et al. COVID-19 impacts on child and youth anxiety and depression: challenges and opportunities. Can J Psychiatry. 2020;65:688-691. 10.1177/0706743720935646 - DOI - PMC - PubMed
    1. Committee on Examining the Working Definition for Long COVID, Board on Health Sciences Policy, Board on Global Health, et al. A Long COVID Definition: A Chronic, Systemic Disease State with Profound Consequences. National Academies Press; 2024. - PubMed
    1. Pfaff ER, Madlock-Brown C, Baratta JM, et al. ; RECOVER Consortium. Coding long COVID: characterizing a new disease through an ICD-10 lens. BMC Med. 2023;21:58. 10.1186/s12916-023-02737-6 - DOI - PMC - PubMed
    1. Malden DE, Tartof SY, Ackerson BK, et al. Natural language processing for improved characterization of COVID-19 symptoms: observational study of 350,000 patients in a large integrated health care system. JMIR Public Health Surveill. 2022;8:e41529. 10.2196/41529 - DOI - PMC - PubMed

LinkOut - more resources