Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies

Affiliations

¹ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA. Electronic address: rwyss@bwh.harvard.edu.
² Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
³ Division of General Internal Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
⁴ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA. Electronic address: jklin@bwh.harvard.edu.

PMID: 40691893
DOI: 10.1016/j.jbi.2025.104882

Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies

Richard Wyss et al. J Biomed Inform. 2025.

. 2025 Jul 19:169:104882.

doi: 10.1016/j.jbi.2025.104882. Online ahead of print.

Affiliations

¹ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA. Electronic address: rwyss@bwh.harvard.edu.
² Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
³ Division of General Internal Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
⁴ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA. Electronic address: jklin@bwh.harvard.edu.

PMID: 40691893
DOI: 10.1016/j.jbi.2025.104882

Abstract

Background: To improve confounding control in healthcare database studies, data-driven algorithms may empirically identify and adjust for large numbers of pre-exposure variables that indirectly capture information on unmeasured confounding factors ('proxy' confounders). Current approaches for high-dimensional proxy adjustment do not leverage free-text notes from electronic health records (EHRs). Unsupervised natural language processing (NLP) technology can scale to generate large numbers of structured features from unstructured notes.

Objective: To assess the impact of supplementing claims data analyses with large numbers of NLP generated features for high-dimensional proxy adjustment.

Methods: We linked Medicare claims with EHR data to generate three cohorts comparing different classes of medications on the 6-month risk of cardiovascular outcomes. We used various NLP methods to generate structured features from free-text EHR notes and used least absolute shrinkage and selection operator (LASSO) regression to fit several propensity score (PS) models that included different covariate sets as candidate predictors. Covariate sets included features generated from claims data only, and claims data plus NLP-generated EHR features.

Results: Including both claims codes and NLP-generated EHR features as candidate predictors improved overall covariate balance with standardized differences being < 0.1 for all variables. While overall balance improved, the impact on estimated treatment effects was more nuanced with adjustment for NLP-generated features moving effect estimates further in the expected direction in two of the empirical studies but had no impact on the third study.

Conclusion: Supplementing administrative claims with large numbers of NLP-generated features for ultra-high-dimensional proxy confounder adjustment improved overall covariate balance and may provide a modest benefit in terms of capturing confounder information.

Keywords: Causal inference; Confounding; Electronic health records; Natural language processing.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Schneeweiss is participating in investigator-initiated grants to the Brigham and Women’s Hospital from Bayer, Vertex, and Boehringer Ingelheim unrelated to the topic of this study. He is a consultant to Aetion Inc., a software manufacturer of which he owns equity. His interests were declared, reviewed, and approved by the Brigham and Women’s Hospital and Partners HealthCare System in accordance with their institutional compliance policies. The remaining authors have no conflicts of interest to declare.

Update of

Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies.
Wyss R, Yang J, Schneeweiss S, Plasek JM, Zhou L, Deramus T, Weberpals JG, Ngan K, Tsacogianis TN, Lin KJ. Wyss R, et al. medRxiv [Preprint]. 2025 Jan 31:2025.01.30.25321403. doi: 10.1101/2025.01.30.25321403. medRxiv. 2025. Update in: J Biomed Inform. 2025 Jul 19;169:104882. doi: 10.1016/j.jbi.2025.104882. PMID: 39974094 Free PMC article. Updated. Preprint.

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies

Affiliations

Natural language processing for scalable feature engineering and ultra-high-dimensional confounding adjustment in healthcare database studies

Authors

Affiliations

Abstract

Conflict of interest statement

Update of

Similar articles

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Update of

Similar articles

Related information

LinkOut - more resources

Full Text Sources