Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Alexander W Levis¹, Rajarshi Mukherjee², Rui Wang^{2

3}, Heidi Fischer⁴, Sebastien Haneuse²

Affiliations

¹ Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.
² Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts.
³ Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts.
⁴ Department of Research and Evaluation, Kaiser Permanente, Pasadena, California, USA.

PMID: 39638313
PMCID: PMC11639654
DOI: 10.1002/sim.10298

Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Alexander W Levis et al. Stat Med. 2024.

. 2024 Dec 30;43(30):6086-6098.

doi: 10.1002/sim.10298. Epub 2024 Dec 5.

Authors

Alexander W Levis¹, Rajarshi Mukherjee², Rui Wang^{2

3}, Heidi Fischer⁴, Sebastien Haneuse²

Affiliations

¹ Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.
² Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts.
³ Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts.
⁴ Department of Research and Evaluation, Kaiser Permanente, Pasadena, California, USA.

PMID: 39638313
PMCID: PMC11639654
DOI: 10.1002/sim.10298

Abstract

Missing data arise in most applied settings and are ubiquitous in electronic health records (EHR). When data are missing not at random (MNAR) with respect to measured covariates, sensitivity analyses are often considered. These solutions, however, are often unsatisfying in that they are not guaranteed to yield actionable conclusions. Motivated by an EHR-based study of long-term outcomes following bariatric surgery, we consider the use of double sampling as a means to mitigate MNAR outcome data when the statistical goals are estimation and inference regarding causal effects. We describe assumptions that are sufficient for the identification of the joint distribution of confounders, treatment, and outcome under this design. Additionally, we derive efficient and robust estimators of the average causal treatment effect under a nonparametric model and under a model assuming outcomes were, in fact, initially missing at random (MAR). We compare these in simulations to an approach that adaptively estimates based on evidence of violation of the MAR assumption. Finally, we also show that the proposed double sampling design can be extended to handle arbitrary coarsening mechanisms, and derive nonparametric efficient estimators of any smooth full data functional.

Keywords: causal inference; double sampling; missing data; semiparametric theory; study design.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**FIGURE 1**
Schematic illustrating the double sampling study design. Gray boxes indicate subgroups for which the outcome is observed: either subjects are observed in the EHR directly ( $R = 1$ ) or are selected in the follow‐up sample ( $R = 0$ and $S = 1$ ).

**FIGURE 2**
Simulation results for experiments of Section 5.1, with (a) no violation of MAR ( $β_{R A} = 0$ ); (b) a moderate violation of MAR ( $β_{R A} = 0.016$ ); and (c) a large violation of MAR ( $β_{R A} = 0.032$ ). S#1 = Strategy #1, augmented IPW‐based estimator assuming MAR, ${\hat{ξ}}_{1} - {\hat{ξ}}_{0}$ ; S#1‐OR = outcome regression‐based estimator assuming MAR; S#1‐IPW = IPW‐based estimator assuming MAR; S#2 = Strategy #2, influence‐function based estimator using double sampling, ${\hat{τ}}_{1} - {\hat{τ}}_{0}$ ; S#2‐PW = S#2 but with ${\hat{π}}_{a}$ misspecified; S#2‐MW = S#2‐DS but with $({\hat{μ}}_{a, S}, {\hat{μ}}_{a, R}, {\hat{γ}}_{a})$ misspecified; S#2‐BW = S#2 but with both $({\hat{μ}}_{a, S}, {\hat{μ}}_{a, R}, {\hat{γ}}_{a})$ and ${\hat{π}}_{a}$ misspecified; S#3 = Strategy #3, semiparametric efficient estimator under MAR, ${\hat{τ}}_{1}^{*} - {\hat{τ}}_{0}^{*}$ . ATE = average treatment effect; the red dashed line indicates the true ATE.

**FIGURE 3**
Simulation results for experiments of Section 5.2, comparing Strategy #2 of the nonparametric influence function‐based estimator ( ${\hat{τ}}_{1} - {\hat{τ}}_{0}$ ), Strategy #3 of the semiparametric efficient estimator under MAR ( ${\hat{τ}}_{1}^{*} - {\hat{τ}}_{0}^{*}$ ), Strategy #4 of the ad hoc approach described in Section 5.1, and Strategy #5 of the adaptive estimator of Rothenhäusler [32] ( ${\hat{τ}}_{1}^{†} - {\hat{τ}}_{0}^{†}$ ). Subplot (a) shows the empirical mean squared error (MSE) of each approach, divided by the MSE of the nonparametric efficient estimator. Subplot (b) shows the empirical variance of each approach, divided by the variance of the nonparametric efficient estimator. Subplot (c) shows the empirical bias of each approach.

**FIGURE 4**
Results for data application in Section 6, showing the estimated effect on percent total weight change at 3 years. Subplots (a), (b), and (c) correspond to follow‐up subsamples of size 500, 1000, and 1500 individuals. Blue and red lines refer to estimated 95% confidence intervals when outcomes are initially MAR, and MNAR, respectively. Point estimates are marked with black dots. Strategy #1 = augmented‐IPW estimator with incomplete data only, assuming MAR; Strategy #2 = influence‐function based estimator using double sampling, ${\hat{τ}}_{1} - {\hat{τ}}_{0}$ ; Strategy #3 = semiparametric efficient estimator under MAR, ${\hat{τ}}_{1}^{*} - {\hat{τ}}_{0}^{*}$ ; Strategy #5 = adaptive estimator of Rothenhäusler [32], ${\hat{τ}}_{1}^{†} - {\hat{τ}}_{0}^{†}$ . The black dashed line represents the point estimate for the benchmark full data analysis.

See this image and copyright information in PMC

References

1. Seaman S. R. and White I. R., “Review of Inverse Probability Weighting for Dealing With Missing Data,” Statistical Methods in Medical Research 22, no. 3 (2013): 278–295. - PubMed
1. Rubin D. B., Multiple Imputation for Nonresponse in Surveys (Hoboken, NJ: John Wiley & Sons, 2004).
1. Robins J. M., Rotnitzky A., and Zhao L. P., “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed,” Journal of the American Statistical Association 89, no. 427 (1994): 846–866.
1. Tsiatis A., Semiparametric Theory and Missing Data (New York, NY: Springer, 2007).
1. Rubin D. B., “Inference and Missing Data,” Biometrika 63, no. 3 (1976): 581–592.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Affiliations

Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous