An open natural language processing (NLP) framework for EHR-based clinical research: a case demonstration using the National COVID Cohort Collaborative (N3C)
- PMID: 37555837
- PMCID: PMC10654844
- DOI: 10.1093/jamia/ocad134
An open natural language processing (NLP) framework for EHR-based clinical research: a case demonstration using the National COVID Cohort Collaborative (N3C)
Abstract
Despite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for Coronavirus Disease 2019 (COVID-19) signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.
Keywords: electronic healthy records; federated learning; multi-institutional data annotation; natural language processing.
© The Author(s) 2023. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Conflict of interest statement
MAH has a founding interest in Pryzm Health. HX and The University of Texas Health Science Center at Houston have financial related interests at Melax Technologies Inc.
References
-
- Haug CJ. From patient to patient–sharing the data from clinical trials. N Engl J Med 2016; 374 (25): 2409–11. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
