High-dimensional, outcome-dependent missing data problems: Models for the human loci
- PMID: 39885761
- PMCID: PMC11951372
- DOI: 10.1177/09622802241304112
High-dimensional, outcome-dependent missing data problems: Models for the human loci
Abstract
Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the KIR loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor KIR genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.
Keywords: KIR genes; Missing data; expectation-maximization algorithm; haplotype reconstruction; multiple imputation; outcome dependent imputation.
Conflict of interest statement
Declaration of conflicting interestsThe author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Figures
References
-
- Heitjan DF, Rubin DB. Ignorability and coarse data. Ann Stat 1991; 19: 2244–2253.
-
- Little RJA, Rubin DB. Statistical Analysis With Missing Data. 3rd ed. Hoboken, NJ: Wiley, 2020.
-
- Moons KGM, T. Donders RAR, Stijnen T, et al.. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol 2006; 59: 1092–1101. - PubMed
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
