. 2023 Mar:139:104295.

doi: 10.1016/j.jbi.2023.104295. Epub 2023 Jan 27.

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Elena Casiraghi¹, Rachel Wong², Margaret Hall², Ben Coleman³, Marco Notaro⁴, Michael D Evans⁵, Jena S Tronieri⁶, Hannah Blau⁷, Bryan Laraway⁸, Tiffany J Callahan⁸, Lauren E Chan⁹, Carolyn T Bramante¹⁰, John B Buse¹¹, Richard A Moffitt², Til Stürmer¹², Steven G Johnson¹³, Yu Raymond Shao¹⁴, Justin Reese¹⁵, Peter N Robinson³, Alberto Paccanaro¹⁶, Giorgio Valentini⁴, Jared D Huling¹⁷, Kenneth J Wilkins¹⁸; N3C Consortium

Affiliations

¹ AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
² Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA.
³ The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA.
⁴ AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy.
⁵ Biostatistical Design and Analysis Center, Clinical and Translational Science Institute, University of Minnesota, Minneapolis, MN, USA.
⁶ Department of Psychiatry, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA.
⁷ The Jackson Laboratory for Genomic Medicine, Farmington, USA.
⁸ University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
⁹ College of Public Health and Human Sciences, Oregon State University, Corvallis, USA.
¹⁰ Division of General Internal Medicine, University of Minnesota, Minneapolis, MN, USA.
¹¹ NC Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Division of Endocrinology, Department of Medicine, University of North Carolina School of Medicine, USA.
¹² Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
¹³ Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA.
¹⁴ Harvard-MIT Division of Health Sciences and Technology (HST), 260 Longwood Ave, Boston, USA; Department of Radiation Oncology, UT Southwestern Medical Center, Dallas, USA.
¹⁵ Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
¹⁶ School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro, Brazil; Department of Computer Science, Royal Holloway, University of London, Egham, UK.
¹⁷ Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA.
¹⁸ Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA.

PMID: 36716983
PMCID: PMC10683778
DOI: 10.1016/j.jbi.2023.104295

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Elena Casiraghi et al. J Biomed Inform. 2023 Mar.

. 2023 Mar:139:104295.

doi: 10.1016/j.jbi.2023.104295. Epub 2023 Jan 27.

Authors

Affiliations

¹ AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
² Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA.
³ The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA.
⁴ AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy.
⁵ Biostatistical Design and Analysis Center, Clinical and Translational Science Institute, University of Minnesota, Minneapolis, MN, USA.
⁶ Department of Psychiatry, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA.
⁷ The Jackson Laboratory for Genomic Medicine, Farmington, USA.
⁸ University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
⁹ College of Public Health and Human Sciences, Oregon State University, Corvallis, USA.
¹⁰ Division of General Internal Medicine, University of Minnesota, Minneapolis, MN, USA.
¹¹ NC Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Division of Endocrinology, Department of Medicine, University of North Carolina School of Medicine, USA.
¹² Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
¹³ Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA.
¹⁴ Harvard-MIT Division of Health Sciences and Technology (HST), 260 Longwood Ave, Boston, USA; Department of Radiation Oncology, UT Southwestern Medical Center, Dallas, USA.
¹⁵ Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
¹⁶ School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro, Brazil; Department of Computer Science, Royal Holloway, University of London, Egham, UK.
¹⁷ Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA.
¹⁸ Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA.

PMID: 36716983
PMCID: PMC10683778
DOI: 10.1016/j.jbi.2023.104295

Abstract

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

Keywords: COVID-19 severity assessment; Clinical informatics; Diabetic patients; Evaluation framework; Multiple Imputation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1:**
The presumed MAR missing data patterns in the Wong et al. [15] dataset.

**Figure 2:**
Schematic diagram of the pipeline used to obtain pooled estimates when applying a MI strategy. The incomplete dataset is imputed $m$ times, where the value of $m$ can be defined in order to maximize the efficiency of the MI estimator (see Appendix A); each imputed dataset is individually processed to compute separate inferences; all the inferences are pooled by Rubin’s rule [4] to get the pooled estimates ( $\hat{Q}$ ), their total variances ( $\hat{V A R}$ ) and standard errors ( $\hat{S E}$ ) and their confidence intervals ( $\hat{C I}$ ). In the figure, we use the superscript $j$ to index the imputations number ( $j \in \{1, \dots, m\}$ ) and the subscript $i$ to index the predictor variable in the dataset (see Table 2 for a detailed list of all the notations used throughout the paper).

**Figure 3:**
Schematic diagram of the pipeline used to evaluate one MI algorithm across A multiple amputation settings. The following steps are applied: 1) listwise deletion is used to produce a complete dataset on which a vector of estimates to be used as “gold standard” is computed; 2) a number A of amputated datasets is computed by using an amputation algorithm that reproduces the same missingness pattern in the original dataset; 3) An MI estimation pipelines (see Figure 2) are applied to get A pooled estimates, their total variances, standard errors and confidence intervals; 4) averaging the A estimates the expected value of the MI estimates are approximated and compared to the gold standard estimates computed on the complete dataset (step 1).

**Figure 4:**
Schematic diagram of the pipeline used to evaluate a multiple imputation algorithm. [TOP light green BOX: compute gold standard estimates] Listwise deletion is used to create a “complete” dataset where all the values are observed; the predictor variables in the complete dataset are normalized to obtain uniform scales across different predictors; statistical estimators (in our experiments they were two logistic regression models and one Cox survival model) are applied to compute statistical estimates describing the influence of the available predictors on O outcome variables (in our experiments they were O = 3 outcomes describing the hospitalization event, the invasive ventilation event, and patients’ survival). [BOTTOM light blue BOX: compute MI estimates] $(I_{1}, \dots, I_{A})$ From the complete dataset, A amputated dataset are computed; $({I I}_{1}, \dots, {I I}_{A})$ each amputated dataset is imputed $m$ times by the MI algorithm under evaluation and (III) each imputed dataset is normalized (as done in the TOP BOX for the complete dataset) to obtain uniform scales across all the predictors in all the imputed datasets and in the complete dataset. (IV) Each imputed-normalized dataset is processed by the O statistical estimators and Rubin’s rule [4] is applied to pool the estimates across the m imputations. (V) The pooled estimates obtained for each outcome and predictor variable are averaged across the A simulations (1 simulation per amputated dataset) to approximate the expected values of the estimate for each predictor and outcome. [YELLOW BOX: compare the gold standard estimates to the imputation estimates] The evaluation measures detailed in Section “Evaluation method” are computed for comparing the computed estimates to the gold standard estimates computed on the complete-normalized dataset (for each of the predictors and outcome variables). Of note, the normalization of the (complete and imputed) dataset predictors to a unique scale before the estimation would allow averaging all the evaluation measures across all the predictors. For each imputation algorithm under evaluation, the depicted pipeline provides a set of evaluation measures that can be analyzed to choose the most valid and performant algorithm for handling the data missingness. The chosen algorithm can be finally applied to the original (incomplete) dataset to obtain reliable statistical estimates.

**Figure 5:**
Average measures obtained by the tested imputation algorithms across the three outcomes (the table is also made available in Supplementary file S1 – sheet “mean_measures_m42”) when MAR missingness is simulated in the amputated datasets. For the RB and MSE measures the highlighted cells mark the models that had less losses according to the paired Wilcoxon rank-sum tests computed over the three outcomes. For the CR measure, all the models, but the (non-augmented) IPW model (where the probability of data being missing was computed by an RF including the outcome variables in the model) had comparable performance. missRanger with no pmm achieves the lowest standard error estimate (indeed the ratio SE measures - column “ratio SE” - is the lowest, as also confirmed by the paired Wilcoxon rank-sign test), IPW models obtain a standard error greater than the one computed on the unweighted dataset.

**Figure 6:**
Average absolute value of the RB measures obtained by the tested imputation algorithms across the 38 predictors and the three outcomes (the table is also made available in Supplementary file S3_MCAR – sheet “mean_measures_m42”) when MCAR missingness is simulated in the amputated datasets.

**Figure 7:**
Average absolute value of the RB measures obtained by the tested imputation algorithms across the 38 predictors and the three outcomes (the table is also made available in Supplementary file S4_MNAR – sheet “mean_measures_m42”) when MNAR missingness is simulated in the amputated datasets.

See this image and copyright information in PMC

References

1. Madden JM, Lakoma MD, Rusinak D, Lu CY, & Soumerai SB (2016). Missing clinical and behavioral health data in a large electronic health record (EHR) system. Journal of the American Medical Informatics Association, 23(6), 1143–1149. - PMC - PubMed
1. Groenwold RH (2020). Informative missingness in electronic health record systems: the curse of knowing. Diagnostic and prognostic research, 4(1), 1–6. - PMC - PubMed
1. Haneuse S, Arterburn D, & Daniels MJ (2021). Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Network Open, 4(2), e210184-e210184. - PubMed
1. Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.
1. Carlin JB (2014). Multiple Imputation: Perspective and Historical Overview. Chapter 12 of Handbook of Missing Data Methodology, Edited by Molenberghs G, Fitzmaurice GM, Kenward MG, Tsiatis A, Verbeke New York: Chapman & Hall/CRC. 10.1201/b17622 - DOI

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Affiliations

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical