. 2020 Jul 25;20(1):199.

doi: 10.1186/s12874-020-01080-1.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Shangzhi Hong¹, Henry S Lynn²

Affiliations

¹ Department of Biostatistics, Key Laboratory on Public Health Safety of the Ministry of Education, School of Public Health, Fudan University, Shanghai, China.
² Department of Biostatistics, Key Laboratory on Public Health Safety of the Ministry of Education, School of Public Health, Fudan University, Shanghai, China. hslynn@shmu.edu.cn.

PMID: 32711455
PMCID: PMC7382855
DOI: 10.1186/s12874-020-01080-1

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Shangzhi Hong et al. BMC Med Res Methodol. 2020.

. 2020 Jul 25;20(1):199.

doi: 10.1186/s12874-020-01080-1.

Authors

Shangzhi Hong¹, Henry S Lynn²

Affiliations

¹ Department of Biostatistics, Key Laboratory on Public Health Safety of the Ministry of Education, School of Public Health, Fudan University, Shanghai, China.
² Department of Biostatistics, Key Laboratory on Public Health Safety of the Ministry of Education, School of Public Health, Fudan University, Shanghai, China. hslynn@shmu.edu.cn.

PMID: 32711455
PMCID: PMC7382855
DOI: 10.1186/s12874-020-01080-1

Abstract

Background: Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions.

Methods: To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM).

Results: Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction.

Conclusions: RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Keywords: Imputation accuracy; Missing data imputation; Random forest.

PubMed Disclaimer

Conflict of interest statement

Authors declare that they have no competing interests.

Figures

**Fig. 1**
Distributions used for covariate X. (a) symmetric distributions (normal and uniform), (b) lognormal distributions, (c) gamma distributions, (d) bimodal distributions (mixture of two normal distributions). The panels display the kernel densities based on 1 million observations randomly sampled from each distribution. $[N (μ_{1}, σ_{1}^{2}), N (μ_{1}, σ_{1}^{2})]$ represents a homogeneous mixture of 50% $Normal (μ_{1}, σ_{1}^{2})$ and 50% $Normal (μ_{1}, σ_{1}^{2})$ . For figures with boxplots, the top and bottom 0.025 percentiles were truncated to avoid extreme values in order to facilitate the visual comparison of the boxplots

**Fig. 2**
Relative bias of the estimated mean of imputed variables for MAR data

**Fig. 3**
Relative bias of the estimated regression coefficient of imputed variables for MAR data

**Fig. 4**
Coverage of 95% confidence intervals (with binomial proportion confidence intervals) of the estimated regression coefficients of imputed variables for MAR data

**Fig. 5**
Lin’s concordance correlation coefficient (CCC) from predictions using models estimated from imputed MAR data

**Fig. 6**
Scatter plot of Y versus X from imputation results of a randomly selected dataset when X~Gamma(1, 1) in scenario 1 for MAR data. Dashed line is for the true model, solid line is for the estimated model using imputed data. Only imputed observations were shown for direct comparison between imputation methods

See this image and copyright information in PMC

References

1. Van Buuren S. Flexible imputation of missing data: chapman and hall/CRC. 2018.
1. Stekhoven DJ, Buhlmann P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. doi: 10.1093/bioinformatics/btr597. - DOI - PubMed
1. Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179(6):764–774. doi: 10.1093/aje/kwt312. - DOI - PMC - PubMed
1. Ramosaj B, Pauly M. Predicting missing values: A comparative study on non-parametric approaches for imputation. Comput Stat. 2019;34(4):1741–1764.
1. Tang F, Ishwaran H. Random Forest missing data algorithms. Stat Analysis Data Mining. 2017;10(6):363–377. doi: 10.1002/sam.11348. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Associated data

Dryad/10.5061/dryad.pd44k8r

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Affiliations

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Associated data

LinkOut - more resources

Full Text Sources