Cross-institution natural language processing for reliable clinical association studies: a methodological exploration
- PMID: 38219811
- DOI: 10.1016/j.jclinepi.2024.111258
Cross-institution natural language processing for reliable clinical association studies: a methodological exploration
Abstract
Objectives: Natural language processing (NLP) of clinical notes in electronic medical records is increasingly used to extract otherwise sparsely available patient characteristics, to assess their association with relevant health outcomes. Manual data curation is resource intensive and NLP methods make these studies more feasible. However, the methodology of using NLP methods reliably in clinical research is understudied. The objective of this study is to investigate how NLP models could be used to extract study variables (specifically exposures) to reliably conduct exposure-outcome association studies.
Study design and setting: In a convenience sample of patients admitted to the intensive care unit of a US academic health system, multiple association studies are conducted, comparing the association estimates based on NLP-extracted vs. manually extracted exposure variables. The association studies varied in NLP model architecture (Bidirectional Encoder Decoder from Transformers, Long Short-Term Memory), training paradigm (training a new model, fine-tuning an existing external model), extracted exposures (employment status, living status, and substance use), health outcomes (having a do-not-resuscitate/intubate code, length of stay, and in-hospital mortality), missing data handling (multiple imputation vs. complete case analysis), and the application of measurement error correction (via regression calibration).
Results: The study was conducted on 1,174 participants (median [interquartile range] age, 61 [50, 73] years; 60.6% male). Additionally, up to 500 discharge reports of participants from the same health system and 2,528 reports of participants from an external health system were used to train the NLP models. Substantial differences were found between the associations based on NLP-extracted and manually extracted exposures under all settings. The error in association was only weakly correlated with the overall F1 score of the NLP models.
Conclusion: Associations estimated using NLP-extracted exposures should be interpreted with caution. Further research is needed to set conditions for reliable use of NLP in medical association studies.
Keywords: Electronic health records; Medical association study; Natural language processing; Real-world evidence; Research methodology; Social determinants of health.
Copyright © 2024 The Author(s). Published by Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: M.S., E.S., M.V.S., and A.M.L. report no additional conflicts of interest.
Similar articles
-
Validation of Prediction Models for Critical Care Outcomes Using Natural Language Processing of Electronic Health Record Data.JAMA Netw Open. 2018 Dec 7;1(8):e185097. doi: 10.1001/jamanetworkopen.2018.5097. JAMA Netw Open. 2018. PMID: 30646310 Free PMC article.
-
Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome.JAMA Netw Open. 2023 Mar 1;6(3):e231204. doi: 10.1001/jamanetworkopen.2023.1204. JAMA Netw Open. 2023. PMID: 36862411 Free PMC article. Clinical Trial.
-
Comparison of Natural Language Processing of Clinical Notes With a Validated Risk-Stratification Tool to Predict Severe Maternal Morbidity.JAMA Netw Open. 2022 Oct 3;5(10):e2234924. doi: 10.1001/jamanetworkopen.2022.34924. JAMA Netw Open. 2022. PMID: 36197662 Free PMC article.
-
Malnutrition and its contributing factors for older people living in residential aged care facilities: Insights from natural language processing of aged care records.Technol Health Care. 2023;31(6):2267-2278. doi: 10.3233/THC-230229. Technol Health Care. 2023. PMID: 37302059 Review.
-
From admission to discharge: a systematic review of clinical natural language processing along the patient journey.BMC Med Inform Decis Mak. 2024 Aug 29;24(1):238. doi: 10.1186/s12911-024-02641-w. BMC Med Inform Decis Mak. 2024. PMID: 39210370 Free PMC article.
Cited by
-
Revealing the impact of social circumstances on the selection of cancer therapy through natural language processing of social work notes.JAMIA Open. 2024 Oct 11;7(4):ooae073. doi: 10.1093/jamiaopen/ooae073. eCollection 2024 Dec. JAMIA Open. 2024. PMID: 39399272 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources