. 2024 Mar 5;51(14):2894-2928.

doi: 10.1080/02664763.2024.2325969. eCollection 2024.

The impact of misclassifications and outliers on imputation methods

M Templ¹, Markus Ulmer²

Affiliations

¹ Institute for Competitiveness and Communication, School of Business, University of Applied Sciences and Art Northwestern Switzerland, Olten, Switzerland.
² Institute of Data Analysis and Process Design, School of Engineering, Zurich University of Applied Sciences, Winterthur, Switzerland.

PMID: 39450101
PMCID: PMC11500630
DOI: 10.1080/02664763.2024.2325969

The impact of misclassifications and outliers on imputation methods

M Templ et al. J Appl Stat. 2024.

. 2024 Mar 5;51(14):2894-2928.

doi: 10.1080/02664763.2024.2325969. eCollection 2024.

Authors

M Templ¹, Markus Ulmer²

Affiliations

¹ Institute for Competitiveness and Communication, School of Business, University of Applied Sciences and Art Northwestern Switzerland, Olten, Switzerland.
² Institute of Data Analysis and Process Design, School of Engineering, Zurich University of Applied Sciences, Winterthur, Switzerland.

PMID: 39450101
PMCID: PMC11500630
DOI: 10.1080/02664763.2024.2325969

Abstract

Many imputation methods have been developed over the years and tested mostly under ideal settings. Surprisingly, there is no detailed research on how imputation methods perform when the idealized assumptions about the distribution of data and/or model assumptions are partly not fulfilled. This research looks into the susceptibility of imputation techniques, particularly in relation to outliers, misclassifications, and incorrect model specifications. This is crucial knowledge about how well the methods convince in everyday life because, in reality, conditions are usually not ideal, and model assumptions may not hold. The data may not fit the defined models well. Outliers distort the estimates, and misclassifications reduce the quality of most imputation methods. Several different evaluation measures are discussed, from comparing imputed values with true values or comparing certain statistics, from the performance of classifiers to the variance of estimated parameters. Some well-known imputation methods are compared based on real data and simulations. It turns out that robust conditional imputation methods outperform other methods for real data and simulation settings.

Keywords: 62-08; Missing values; imputation; misclassifications; outliers; robust methods; simulation.

PubMed Disclaimer

Conflict of interest statement

No potential conflict of interest was reported by the author(s).

Figures

**Figure 1.**
Results for the Animals data set. True values set to missing (grey triangles), imputed values (red points) and their connecting lines to the true values (dashed gray lines), robust tolerance ellipses of the complete data (ellipses in solid black lines) and after imputation (ellipses in black dashed lines) and the subset of values to be imputed (ellipses in solid red for true values and in dashed red lines for imputed values).

**Figure 2.**
The biplot for the complete data set (Prestige) is shown in the upper left. The gray dots represent values that are set artificially to missing with MAR. The imputed values from different methods are also shown in gray, while the fully observed observations are again in black.

**Figure 3.**
NRMSE and MSECOR from the imputation of the Prestige data.

**Figure 4.**
NRMSE and MSECOR from the imputation of the Bushfire data (in log-scale).

**Figure 5.**
NRMSE and MSECOR from imputation of the Freedman data.

**Figure 6.**
NRMSE, MSECOR (both in log10-scale) and false classifications of imputed values from the imputation of the Iris data.

**Figure 7.**
Results on the root mean squared error of the arithmetic mean estimates based on the simulation setting 1. The thick green line represents the actual method, while the thin light gray lines are used for comparison with other methods.

**Figure 8.**
Results on the coverage rate of the confidence interval of the arithmetic mean based on the simulation setting 1. The thick green line represents the actual method, while the thin light gray lines are put for comparison with other methods.

**Figure 9.**
Results on the root mean squared error of the arithmetic mean estimates based on the simulation setting 2. The thick green line represents the actual method, while the thin light gray lines are used for comparison with other methods.

**Figure 10.**
Results on the coverage rate of the confidence interval of the arithmetic mean based on the simulation setting 1. The thick green line represents the actual method, while the thin light gray lines are used for comparison with other methods.

**Figure 11.**
F1 score of a random forest classifier for the credit card fraud dataset after including and imputing missing with different imputation methods.

**Figure 12.**
F1 score of a random forest classifier for the sonar dataset after including and imputing missing with different imputation methods.

See this image and copyright information in PMC

References

1. Béguin C. and Hulliger B., The BACON-EEM algorithm for multivariate outlier detection in incomplete survey data, Surv. Methodol. 34 (2008), pp. 91–103.
1. Belin T.R., Hu M.Y., Young A.S., and Grusky O., Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study, Stat. Med. 18 (1999), pp. 3123–3135. - PubMed
1. Bertsimas D., Pawlowski C., and Zhuo Y.D., From predictive methods to missing data imputation: An optimization approach, J. Mach. Learn. Res. 18 (2018), pp. 1–39.
1. Bill M. and Hulliger B., Treatment of multivariate outliers in incomplete business survey data, Austrian J. Stat. 45 (2016), pp. 3–23.
1. Campbell N.A., Bushfire maping using NOAA AVHRR data, Technical Report, CSIRO, 1989.

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The impact of misclassifications and outliers on imputation methods

Affiliations

The impact of misclassifications and outliers on imputation methods

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources