Development and validation of natural language processing algorithms in the national ENACT network

Yanshan Wang^{1

2

3}, Jordan Hilsman^{1

2}, Chenyu Li^{1

2

3}, Michele Morris³, Paul M Heider⁴, Sunyang Fu⁵, Min Ji Kwak⁶, Andrew Wen⁵, Joseph R Applegate⁵, Liwei Wang⁵, Elmer Bernstam^{5

7}, Hongfang Liu⁵, Jack Chang⁸, Daniel R Harris⁹, Alexandria Corbeau⁹, Darren Henderson⁹, John Osborne¹⁰, Richard E Kennedy¹¹, Nelly-Estefanie Garduno-Rapp¹², Justin F Rousseau^{12

13}, Chao Yan¹⁴, You Chen¹⁴, Mayur B Patel¹⁵, Tyler J Murphy¹⁵, Bradley A Malin¹⁴, Chan Mi Park¹⁶, Jungwei W Fan^{17

18}, Sunghwan Sohn¹⁷, Sandeep Pagali¹⁹, Yifan Peng^{20

21}, Aman Pathak²², Yonghui Wu²², Zongqi Xia²³, Salvatore Loguercio²⁴, Steven E Reis¹, Shyam Visweswaran^{1

3}

Affiliations

¹ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
² Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
³ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁴ Biomedical Informatics Center and Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA.
⁵ McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁶ McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁷ Division of General Internal Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁸ Clinical and Translational Science Institute, University of Rochester Medical Center, Rochester, NY, USA.
⁹ Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA.
¹⁰ Department of Biomedical Informatics and Data Science, University of Alabama at Birmingham, Birmingham, AL, USA.
¹¹ Division of Gerontology, Geriatrics, and Palliative Care, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA.
¹² Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
¹³ Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA.
¹⁴ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
¹⁵ Department of Surgery, Vanderbilt University Medical Center, Nashville, TN, USA.
¹⁶ Department of Gerontology, Hebrew SeniorLife, Marcus Institute for Aging Research, Boston, MA, USA.
¹⁷ Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA.
¹⁸ Center for Clinical and Translational Science, Mayo Clinic, Rochester, MN, USA.
¹⁹ Department of Medicine, Mayo Clinic, Rochester, MN, USA.
²⁰ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
²¹ Clinical & Translational Science Center, Weill Cornell Medicine, New York, NY, USA.
²² Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA.
²³ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
²⁴ Scripps Research, Scripps Research Translational Institute, La Jolla, CA, USA.

PMID: 40979101
PMCID: PMC12444719
DOI: 10.1017/cts.2025.10116

Development and validation of natural language processing algorithms in the national ENACT network

Yanshan Wang et al. J Clin Transl Sci. 2025.

. 2025 Aug 22;9(1):e199.

doi: 10.1017/cts.2025.10116. eCollection 2025.

Authors

Affiliations

¹ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
² Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
³ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁴ Biomedical Informatics Center and Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA.
⁵ McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁶ McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁷ Division of General Internal Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁸ Clinical and Translational Science Institute, University of Rochester Medical Center, Rochester, NY, USA.
⁹ Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA.
¹⁰ Department of Biomedical Informatics and Data Science, University of Alabama at Birmingham, Birmingham, AL, USA.
¹¹ Division of Gerontology, Geriatrics, and Palliative Care, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA.
¹² Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
¹³ Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA.
¹⁴ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
¹⁵ Department of Surgery, Vanderbilt University Medical Center, Nashville, TN, USA.
¹⁶ Department of Gerontology, Hebrew SeniorLife, Marcus Institute for Aging Research, Boston, MA, USA.
¹⁷ Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA.
¹⁸ Center for Clinical and Translational Science, Mayo Clinic, Rochester, MN, USA.
¹⁹ Department of Medicine, Mayo Clinic, Rochester, MN, USA.
²⁰ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
²¹ Clinical & Translational Science Center, Weill Cornell Medicine, New York, NY, USA.
²² Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA.
²³ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
²⁴ Scripps Research, Scripps Research Translational Institute, La Jolla, CA, USA.

PMID: 40979101
PMCID: PMC12444719
DOI: 10.1017/cts.2025.10116

Abstract

Objective: Electronic Health Record (EHR) data are critical for advancing translational research and AI technologies. The ENACT network offers access to structured EHR data across 57 CTSA hubs. However, substantial information is contained in clinical narratives, requiring natural language processing (NLP) for research. The ENACT NLP Working Group was formed to make NLP-derived clinical information accessible and queryable across the network.

Methods: We established the ENACT NLP Working Group with 13 sites selected based on criteria including clinical notes access, IT infrastructure, NLP expertise, and institutional support. We divided sites into five focus groups targeting clinical tasks within disease contexts. Each focus group consisted of two development sites and two validation sites. We extended the ENACT ontology to standardize NLP-derived data and conducted multisite evaluations using the Open Health Natural Language Processing (OHNLP) Toolkit.

Results: The working group achieved 100% site retention and deployed NLP infrastructure across all sites. We developed and validated NLP algorithms for rare disease phenotyping, social determinants of health, opioid use disorder, sleep phenotyping, and delirium phenotyping. Performance varied across sites (F1 scores 0.53-0.96), highlighting data heterogeneity impacts. We extended the ENACT common data model and ontology to incorporate NLP-derived data while maintaining Shared Health Research Informatics NEtwork (SHRINE) compatibility.

Conclusion: This demonstrates feasibility of deploying NLP infrastructure across large, federated networks. The focus group approach proved more practical than general-purpose approaches. Key lessons include the challenge of data heterogeneity and importance of collaborative governance. This work also provides a foundation that other networks can build on to implement NLP capabilities for translational research.

Keywords: ENACT; Translational research; electronic health records; natural language processing; network.

PubMed Disclaimer

Conflict of interest statement

No competing interests were declared.

Figures

**Figure 1.**
Participating sites in the evolve to next-gen accrual to clinical trials (ENACT) network natural language processing (NLP) working group.

**Figure 2.**
An overview of the ENACT NLP workflow. *SHRIN= shared health research information network.

See this image and copyright information in PMC

References

1. Visweswaran S, Becich MJ, D’Itri VS, et al. Accrual to clinical trials (ACT): A Clinical and Translational Science Award Consortium Network. JAMIA Open 2018;1:147–152. - PMC - PubMed
1. Tang AS, Woldemariam SR, Miramontes S, Norgeot B, Oskotsky TT, Sirota M. Harnessing EHR data for health research. Nat Med 2024;30:1847–1855. - PubMed
1. Zhang Y, Cai T, Yu S, et al. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP). Nat Protoc 2019;14:3426–3444. - PMC - PubMed
1. Xu D, Wang C, Khan A, et al. Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies. NPJ Digit Med 2021;4:116. - PMC - PubMed
1. Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: Benefits, risks, and strategies for success. NPJ Digit Med 2020;3:17. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Development and validation of natural language processing algorithms in the national ENACT network

Affiliations

Development and validation of natural language processing algorithms in the national ENACT network

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous