Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Sep;31(9):932-943.
doi: 10.1002/pds.5500. Epub 2022 Jul 5.

Machine learning for improving high-dimensional proxy confounder adjustment in healthcare database studies: An overview of the current literature

Affiliations
Review

Machine learning for improving high-dimensional proxy confounder adjustment in healthcare database studies: An overview of the current literature

Richard Wyss et al. Pharmacoepidemiol Drug Saf. 2022 Sep.

Abstract

Purpose: Supplementing investigator-specified variables with large numbers of empirically identified features that collectively serve as 'proxies' for unspecified or unmeasured factors can often improve confounding control in studies utilizing administrative healthcare databases. Consequently, there has been a recent focus on the development of data-driven methods for high-dimensional proxy confounder adjustment in pharmacoepidemiologic research. In this paper, we survey current approaches and recent advancements for high-dimensional proxy confounder adjustment in healthcare database studies.

Methods: We discuss considerations underpinning three areas for high-dimensional proxy confounder adjustment: (1) feature generation-transforming raw data into covariates (or features) to be used for proxy adjustment; (2) covariate prioritization, selection, and adjustment; and (3) diagnostic assessment. We discuss challenges and avenues of future development within each area.

Results: There is a large literature on methods for high-dimensional confounder prioritization/selection, but relatively little has been written on best practices for feature generation and diagnostic assessment. Consequently, these areas have particular limitations and challenges.

Conclusions: There is a growing body of evidence showing that machine-learning algorithms for high-dimensional proxy-confounder adjustment can supplement investigator-specified variables to improve confounding control compared to adjustment based on investigator-specified variables alone. However, more research is needed on best practices for feature generation and diagnostic assessment when applying methods for high-dimensional proxy confounder adjustment in pharmacoepidemiologic studies.

Keywords: causal inference; confounding; machine learning.

PubMed Disclaimer

Conflict of interest statement

Robert W. Platt has consulted for Amgen, Biogen, Merck, Nant Pharma, and Pfizer. Dimitri Bennett is an employee of Takeda. Grammati Sari is employed by Visible Analytics Ltd. Hongbo Yuan is an employee of CADTH. Andrew R. Zullo receives research grant funding from Sanofi Pasteur to support research on infections and vaccinations in nursing homes unrelated to this manuscript. Mugdha Gokhale is a full‐time employee of Merck and owns stocks in Merck. Elisabetta Patorno is supported by a career development grant K08AG055670 from the National Institute on Aging. She is researcher of a researcher‐initiated grant to the Brigham and Women's Hospital from Boehringer Ingelheim, not directly related to the topic of the submitted work.

Figures

FIGURE 1
FIGURE 1
Illustration and examples for ‘proxy confounder’ adjustment.
FIGURE 2
FIGURE 2
Different phases for high‐dimensional proxy confounder adjustment.
FIGURE 3
FIGURE 3
Causal diagram illustrating one scenario where the use of marginal empirical associations for confounder selection can result in over‐adjusting for instrumental variables. In this causal structure, X 2 is marginally associated with both treatment and outcome, but is independent of the outcome after conditioning on X 1.

References

    1. Corrigan‐Curay J, Sacks L, Woodcock J. Real‐world evidence and real‐world data for evaluating drug safety and effectiveness. JAMA. 2018;320:867‐868. - PubMed
    1. Streeter AJ, Lin NX, Crathorne L, et al. Adjusting for unmeasured confounding in nonrandomized longitudinal studies: a methodological review. J Clin Epidemiol. 2017;87:23‐34. - PMC - PubMed
    1. VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol. 2019;34:211‐219. - PMC - PubMed
    1. Schneeweiss S. Automated data‐adaptive analytics for electronic healthcare data to study causal treatment effects. Clin Epidemiol. 2018;10:771‐788. - PMC - PubMed
    1. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58:323‐337. - PubMed

Publication types