Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 5;51(14):2894-2928.
doi: 10.1080/02664763.2024.2325969. eCollection 2024.

The impact of misclassifications and outliers on imputation methods

Affiliations

The impact of misclassifications and outliers on imputation methods

M Templ et al. J Appl Stat. .

Abstract

Many imputation methods have been developed over the years and tested mostly under ideal settings. Surprisingly, there is no detailed research on how imputation methods perform when the idealized assumptions about the distribution of data and/or model assumptions are partly not fulfilled. This research looks into the susceptibility of imputation techniques, particularly in relation to outliers, misclassifications, and incorrect model specifications. This is crucial knowledge about how well the methods convince in everyday life because, in reality, conditions are usually not ideal, and model assumptions may not hold. The data may not fit the defined models well. Outliers distort the estimates, and misclassifications reduce the quality of most imputation methods. Several different evaluation measures are discussed, from comparing imputed values with true values or comparing certain statistics, from the performance of classifiers to the variance of estimated parameters. Some well-known imputation methods are compared based on real data and simulations. It turns out that robust conditional imputation methods outperform other methods for real data and simulation settings.

Keywords: 62-08; Missing values; imputation; misclassifications; outliers; robust methods; simulation.

PubMed Disclaimer

Conflict of interest statement

No potential conflict of interest was reported by the author(s).

Figures

Figure 1.
Figure 1.
Results for the Animals data set. True values set to missing (grey triangles), imputed values (red points) and their connecting lines to the true values (dashed gray lines), robust tolerance ellipses of the complete data (ellipses in solid black lines) and after imputation (ellipses in black dashed lines) and the subset of values to be imputed (ellipses in solid red for true values and in dashed red lines for imputed values).
Figure 2.
Figure 2.
The biplot for the complete data set (Prestige) is shown in the upper left. The gray dots represent values that are set artificially to missing with MAR. The imputed values from different methods are also shown in gray, while the fully observed observations are again in black.
Figure 3.
Figure 3.
NRMSE and MSECOR from the imputation of the Prestige data.
Figure 4.
Figure 4.
NRMSE and MSECOR from the imputation of the Bushfire data (in log-scale).
Figure 5.
Figure 5.
NRMSE and MSECOR from imputation of the Freedman data.
Figure 6.
Figure 6.
NRMSE, MSECOR (both in log10-scale) and false classifications of imputed values from the imputation of the Iris data.
Figure 7.
Figure 7.
Results on the root mean squared error of the arithmetic mean estimates based on the simulation setting 1. The thick green line represents the actual method, while the thin light gray lines are used for comparison with other methods.
Figure 8.
Figure 8.
Results on the coverage rate of the confidence interval of the arithmetic mean based on the simulation setting 1. The thick green line represents the actual method, while the thin light gray lines are put for comparison with other methods.
Figure 9.
Figure 9.
Results on the root mean squared error of the arithmetic mean estimates based on the simulation setting 2. The thick green line represents the actual method, while the thin light gray lines are used for comparison with other methods.
Figure 10.
Figure 10.
Results on the coverage rate of the confidence interval of the arithmetic mean based on the simulation setting 1. The thick green line represents the actual method, while the thin light gray lines are used for comparison with other methods.
Figure 11.
Figure 11.
F1 score of a random forest classifier for the credit card fraud dataset after including and imputing missing with different imputation methods.
Figure 12.
Figure 12.
F1 score of a random forest classifier for the sonar dataset after including and imputing missing with different imputation methods.

References

    1. Béguin C. and Hulliger B., The BACON-EEM algorithm for multivariate outlier detection in incomplete survey data, Surv. Methodol. 34 (2008), pp. 91–103.
    1. Belin T.R., Hu M.Y., Young A.S., and Grusky O., Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study, Stat. Med. 18 (1999), pp. 3123–3135. - PubMed
    1. Bertsimas D., Pawlowski C., and Zhuo Y.D., From predictive methods to missing data imputation: An optimization approach, J. Mach. Learn. Res. 18 (2018), pp. 1–39.
    1. Bill M. and Hulliger B., Treatment of multivariate outliers in incomplete business survey data, Austrian J. Stat. 45 (2016), pp. 3–23.
    1. Campbell N.A., Bushfire maping using NOAA AVHRR data, Technical Report, CSIRO, 1989.

LinkOut - more resources