Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar:139:104295.
doi: 10.1016/j.jbi.2023.104295. Epub 2023 Jan 27.

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Affiliations

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Elena Casiraghi et al. J Biomed Inform. 2023 Mar.

Abstract

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

Keywords: COVID-19 severity assessment; Clinical informatics; Diabetic patients; Evaluation framework; Multiple Imputation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1:
Figure 1:
The presumed MAR missing data patterns in the Wong et al. [15] dataset.
Figure 2:
Figure 2:
Schematic diagram of the pipeline used to obtain pooled estimates when applying a MI strategy. The incomplete dataset is imputed m times, where the value of m can be defined in order to maximize the efficiency of the MI estimator (see Appendix A); each imputed dataset is individually processed to compute separate inferences; all the inferences are pooled by Rubin’s rule [4] to get the pooled estimates (Q^), their total variances (VAR^) and standard errors (SE^) and their confidence intervals (CI^). In the figure, we use the superscript j to index the imputations number (j1,,m) and the subscript i to index the predictor variable in the dataset (see Table 2 for a detailed list of all the notations used throughout the paper).
Figure 3:
Figure 3:
Schematic diagram of the pipeline used to evaluate one MI algorithm across A multiple amputation settings. The following steps are applied: 1) listwise deletion is used to produce a complete dataset on which a vector of estimates to be used as “gold standard” is computed; 2) a number A of amputated datasets is computed by using an amputation algorithm that reproduces the same missingness pattern in the original dataset; 3) An MI estimation pipelines (see Figure 2) are applied to get A pooled estimates, their total variances, standard errors and confidence intervals; 4) averaging the A estimates the expected value of the MI estimates are approximated and compared to the gold standard estimates computed on the complete dataset (step 1).
Figure 4:
Figure 4:
Schematic diagram of the pipeline used to evaluate a multiple imputation algorithm. [TOP light green BOX: compute gold standard estimates] Listwise deletion is used to create a “complete” dataset where all the values are observed; the predictor variables in the complete dataset are normalized to obtain uniform scales across different predictors; statistical estimators (in our experiments they were two logistic regression models and one Cox survival model) are applied to compute statistical estimates describing the influence of the available predictors on O outcome variables (in our experiments they were O = 3 outcomes describing the hospitalization event, the invasive ventilation event, and patients’ survival). [BOTTOM light blue BOX: compute MI estimates] I1,,IA From the complete dataset, A amputated dataset are computed; II1,,IIA each amputated dataset is imputed m times by the MI algorithm under evaluation and (III) each imputed dataset is normalized (as done in the TOP BOX for the complete dataset) to obtain uniform scales across all the predictors in all the imputed datasets and in the complete dataset. (IV) Each imputed-normalized dataset is processed by the O statistical estimators and Rubin’s rule [4] is applied to pool the estimates across the m imputations. (V) The pooled estimates obtained for each outcome and predictor variable are averaged across the A simulations (1 simulation per amputated dataset) to approximate the expected values of the estimate for each predictor and outcome. [YELLOW BOX: compare the gold standard estimates to the imputation estimates] The evaluation measures detailed in Section “Evaluation method” are computed for comparing the computed estimates to the gold standard estimates computed on the complete-normalized dataset (for each of the predictors and outcome variables). Of note, the normalization of the (complete and imputed) dataset predictors to a unique scale before the estimation would allow averaging all the evaluation measures across all the predictors. For each imputation algorithm under evaluation, the depicted pipeline provides a set of evaluation measures that can be analyzed to choose the most valid and performant algorithm for handling the data missingness. The chosen algorithm can be finally applied to the original (incomplete) dataset to obtain reliable statistical estimates.
Figure 5:
Figure 5:
Average measures obtained by the tested imputation algorithms across the three outcomes (the table is also made available in Supplementary file S1 – sheet “mean_measures_m42”) when MAR missingness is simulated in the amputated datasets. For the RB and MSE measures the highlighted cells mark the models that had less losses according to the paired Wilcoxon rank-sum tests computed over the three outcomes. For the CR measure, all the models, but the (non-augmented) IPW model (where the probability of data being missing was computed by an RF including the outcome variables in the model) had comparable performance. missRanger with no pmm achieves the lowest standard error estimate (indeed the ratio SE measures - column “ratio SE” - is the lowest, as also confirmed by the paired Wilcoxon rank-sign test), IPW models obtain a standard error greater than the one computed on the unweighted dataset.
Figure 6:
Figure 6:
Average absolute value of the RB measures obtained by the tested imputation algorithms across the 38 predictors and the three outcomes (the table is also made available in Supplementary file S3_MCAR – sheet “mean_measures_m42”) when MCAR missingness is simulated in the amputated datasets.
Figure 7:
Figure 7:
Average absolute value of the RB measures obtained by the tested imputation algorithms across the 38 predictors and the three outcomes (the table is also made available in Supplementary file S4_MNAR – sheet “mean_measures_m42”) when MNAR missingness is simulated in the amputated datasets.

References

    1. Madden JM, Lakoma MD, Rusinak D, Lu CY, & Soumerai SB (2016). Missing clinical and behavioral health data in a large electronic health record (EHR) system. Journal of the American Medical Informatics Association, 23(6), 1143–1149. - PMC - PubMed
    1. Groenwold RH (2020). Informative missingness in electronic health record systems: the curse of knowing. Diagnostic and prognostic research, 4(1), 1–6. - PMC - PubMed
    1. Haneuse S, Arterburn D, & Daniels MJ (2021). Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Network Open, 4(2), e210184-e210184. - PubMed
    1. Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.
    1. Carlin JB (2014). Multiple Imputation: Perspective and Historical Overview. Chapter 12 of Handbook of Missing Data Methodology, Edited by Molenberghs G, Fitzmaurice GM, Kenward MG, Tsiatis A, Verbeke New York: Chapman & Hall/CRC. 10.1201/b17622 - DOI

Publication types

Grants and funding