Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 30;43(30):6086-6098.
doi: 10.1002/sim.10298. Epub 2024 Dec 5.

Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Affiliations

Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Alexander W Levis et al. Stat Med. .

Abstract

Missing data arise in most applied settings and are ubiquitous in electronic health records (EHR). When data are missing not at random (MNAR) with respect to measured covariates, sensitivity analyses are often considered. These solutions, however, are often unsatisfying in that they are not guaranteed to yield actionable conclusions. Motivated by an EHR-based study of long-term outcomes following bariatric surgery, we consider the use of double sampling as a means to mitigate MNAR outcome data when the statistical goals are estimation and inference regarding causal effects. We describe assumptions that are sufficient for the identification of the joint distribution of confounders, treatment, and outcome under this design. Additionally, we derive efficient and robust estimators of the average causal treatment effect under a nonparametric model and under a model assuming outcomes were, in fact, initially missing at random (MAR). We compare these in simulations to an approach that adaptively estimates based on evidence of violation of the MAR assumption. Finally, we also show that the proposed double sampling design can be extended to handle arbitrary coarsening mechanisms, and derive nonparametric efficient estimators of any smooth full data functional.

Keywords: causal inference; double sampling; missing data; semiparametric theory; study design.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

FIGURE 1
FIGURE 1
Schematic illustrating the double sampling study design. Gray boxes indicate subgroups for which the outcome is observed: either subjects are observed in the EHR directly (R=1) or are selected in the follow‐up sample (R=0 and S=1).
FIGURE 2
FIGURE 2
Simulation results for experiments of Section 5.1, with (a) no violation of MAR (βRA=0); (b) a moderate violation of MAR (βRA=0.016); and (c) a large violation of MAR (βRA=0.032). S#1 = Strategy #1, augmented IPW‐based estimator assuming MAR, ξ^1ξ^0; S#1‐OR = outcome regression‐based estimator assuming MAR; S#1‐IPW = IPW‐based estimator assuming MAR; S#2 = Strategy #2, influence‐function based estimator using double sampling, τ^1τ^0; S#2‐PW = S#2 but with π^a misspecified; S#2‐MW = S#2‐DS but with (μ^a,S,μ^a,R,γ^a) misspecified; S#2‐BW = S#2 but with both (μ^a,S,μ^a,R,γ^a) and π^a misspecified; S#3 = Strategy #3, semiparametric efficient estimator under MAR, τ^1τ^0. ATE = average treatment effect; the red dashed line indicates the true ATE.
FIGURE 3
FIGURE 3
Simulation results for experiments of Section 5.2, comparing Strategy #2 of the nonparametric influence function‐based estimator (τ^1τ^0), Strategy #3 of the semiparametric efficient estimator under MAR (τ^1τ^0), Strategy #4 of the ad hoc approach described in Section 5.1, and Strategy #5 of the adaptive estimator of Rothenhäusler [32] (τ^1τ^0). Subplot (a) shows the empirical mean squared error (MSE) of each approach, divided by the MSE of the nonparametric efficient estimator. Subplot (b) shows the empirical variance of each approach, divided by the variance of the nonparametric efficient estimator. Subplot (c) shows the empirical bias of each approach.
FIGURE 4
FIGURE 4
Results for data application in Section 6, showing the estimated effect on percent total weight change at 3 years. Subplots (a), (b), and (c) correspond to follow‐up subsamples of size 500, 1000, and 1500 individuals. Blue and red lines refer to estimated 95% confidence intervals when outcomes are initially MAR, and MNAR, respectively. Point estimates are marked with black dots. Strategy #1 = augmented‐IPW estimator with incomplete data only, assuming MAR; Strategy #2 = influence‐function based estimator using double sampling, τ^1τ^0; Strategy #3 = semiparametric efficient estimator under MAR, τ^1τ^0; Strategy #5 = adaptive estimator of Rothenhäusler [32], τ^1τ^0. The black dashed line represents the point estimate for the benchmark full data analysis.

References

    1. Seaman S. R. and White I. R., “Review of Inverse Probability Weighting for Dealing With Missing Data,” Statistical Methods in Medical Research 22, no. 3 (2013): 278–295. - PubMed
    1. Rubin D. B., Multiple Imputation for Nonresponse in Surveys (Hoboken, NJ: John Wiley & Sons, 2004).
    1. Robins J. M., Rotnitzky A., and Zhao L. P., “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed,” Journal of the American Statistical Association 89, no. 427 (1994): 846–866.
    1. Tsiatis A., Semiparametric Theory and Missing Data (New York, NY: Springer, 2007).
    1. Rubin D. B., “Inference and Missing Data,” Biometrika 63, no. 3 (1976): 581–592.

MeSH terms

LinkOut - more resources