Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 25;20(1):199.
doi: 10.1186/s12874-020-01080-1.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Affiliations

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Shangzhi Hong et al. BMC Med Res Methodol. .

Abstract

Background: Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions.

Methods: To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM).

Results: Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction.

Conclusions: RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Keywords: Imputation accuracy; Missing data imputation; Random forest.

PubMed Disclaimer

Conflict of interest statement

Authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Distributions used for covariate X. (a) symmetric distributions (normal and uniform), (b) lognormal distributions, (c) gamma distributions, (d) bimodal distributions (mixture of two normal distributions). The panels display the kernel densities based on 1 million observations randomly sampled from each distribution. Nμ1σ12Nμ1σ12 represents a homogeneous mixture of 50% Normalμ1σ12 and 50% Normalμ1σ12. For figures with boxplots, the top and bottom 0.025 percentiles were truncated to avoid extreme values in order to facilitate the visual comparison of the boxplots
Fig. 2
Fig. 2
Relative bias of the estimated mean of imputed variables for MAR data
Fig. 3
Fig. 3
Relative bias of the estimated regression coefficient of imputed variables for MAR data
Fig. 4
Fig. 4
Coverage of 95% confidence intervals (with binomial proportion confidence intervals) of the estimated regression coefficients of imputed variables for MAR data
Fig. 5
Fig. 5
Lin’s concordance correlation coefficient (CCC) from predictions using models estimated from imputed MAR data
Fig. 6
Fig. 6
Scatter plot of Y versus X from imputation results of a randomly selected dataset when X~Gamma(1, 1) in scenario 1 for MAR data. Dashed line is for the true model, solid line is for the estimated model using imputed data. Only imputed observations were shown for direct comparison between imputation methods

References

    1. Van Buuren S. Flexible imputation of missing data: chapman and hall/CRC. 2018.
    1. Stekhoven DJ, Buhlmann P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. doi: 10.1093/bioinformatics/btr597. - DOI - PubMed
    1. Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179(6):764–774. doi: 10.1093/aje/kwt312. - DOI - PMC - PubMed
    1. Ramosaj B, Pauly M. Predicting missing values: A comparative study on non-parametric approaches for imputation. Comput Stat. 2019;34(4):1741–1764.
    1. Tang F, Ishwaran H. Random Forest missing data algorithms. Stat Analysis Data Mining. 2017;10(6):363–377. doi: 10.1002/sam.11348. - DOI - PMC - PubMed

LinkOut - more resources