Does the missing data imputation method affect the composition and performance of prognostic models?
- PMID: 22737551
- PMCID: PMC3372019
Does the missing data imputation method affect the composition and performance of prognostic models?
Abstract
Background: We already showed the superiority of imputation of missing data (via Multivariable Imputation via Chained Equations (MICE) method) over exclusion of them; however, the methodology of MICE is complicated. Furthermore, easier imputation methods are available. The aim of this study was to compare them in terms of model composition and performance.
Methods: Three hundreds and ten breast cancer patients were recruited. Four approaches were applied to impute missing data. First we adopted an ad hoc method in which missing data for each variable was replaced by the median of observed values. Then 3 likelihood-based approaches were used. In the regression imputation, a regression model compared the variable with missing data to the rest of the variables. The regression equation was used to fill the missing data. The Expectation Maximum (E-M) algorithm was implemented in which missing data and regression parameters were estimated iteratively until convergence of regression parameters. Finally, the MICE method was applied. Models developed were compared in terms of variables significantly contributed to the multifactorial analysis, sensitivity and specificity.
Results: All candidate variables significantly contributed to the MICE model. However, grade of disease lost its effect in other three models. The MICE model showed the best performance followed by E-M model.
Conclusion: Among imputation methods, final models were not the same, in terms of composition and perform-ance. Therefore, modern imputation methods are recommended to recover the information.
Keywords: Breast cancer; Data; Expectation maximum algorithm; Multivariable imputation via chained equations.
Conflict of interest statement
Similar articles
-
Multiple imputation with missing data indicators.Stat Methods Med Res. 2021 Dec;30(12):2685-2700. doi: 10.1177/09622802211047346. Epub 2021 Oct 13. Stat Methods Med Res. 2021. PMID: 34643465 Free PMC article.
-
Logistic regression vs. predictive mean matching for imputing binary covariates.Stat Methods Med Res. 2023 Nov;32(11):2172-2183. doi: 10.1177/09622802231198795. Epub 2023 Sep 26. Stat Methods Med Res. 2023. PMID: 37750213 Free PMC article.
-
Imputation of missing values of tumour stage in population-based cancer registration.BMC Med Res Methodol. 2011 Sep 19;11:129. doi: 10.1186/1471-2288-11-129. BMC Med Res Methodol. 2011. PMID: 21929796 Free PMC article.
-
Predictors of clinical outcome in pediatric oligodendroglioma: meta-analysis of individual patient data and multiple imputation.J Neurosurg Pediatr. 2018 Feb;21(2):153-163. doi: 10.3171/2017.7.PEDS17133. Epub 2017 Dec 1. J Neurosurg Pediatr. 2018. PMID: 29192869 Review.
-
Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation.Psychol Methods. 2016 Jun;21(2):222-40. doi: 10.1037/met0000063. Epub 2015 Dec 21. Psychol Methods. 2016. PMID: 26690775 Review.
Cited by
-
Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.J Med Syst. 2018 Apr 10;42(5):92. doi: 10.1007/s10916-018-0940-7. J Med Syst. 2018. PMID: 29637403 Free PMC article.
-
Influence of pattern of missing data on performance of imputation methods: an example using national data on drug injection in prisons.Int J Health Policy Manag. 2013 Jun 3;1(1):69-77. doi: 10.15171/ijhpm.2013.11. eCollection 2013 Jun. Int J Health Policy Manag. 2013. PMID: 24596839 Free PMC article.
-
Selection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets.Addict Health. 2014 Winter;6(1-2):36-44. Addict Health. 2014. PMID: 25140216 Free PMC article.
-
Assessment of Internal Validity of Prognostic Models through Bootstrapping and Multiple Imputation of Missing Data.Iran J Public Health. 2012;41(5):110-5. Epub 2012 May 31. Iran J Public Health. 2012. PMID: 23113185 Free PMC article.
-
Developing and externally validating a machine learning risk prediction model for 30-day mortality after stroke using national stroke registers in the UK and Sweden.BMJ Open. 2023 Nov 15;13(11):e069811. doi: 10.1136/bmjopen-2022-069811. BMJ Open. 2023. PMID: 37968001 Free PMC article.
References
-
- Baneshi MR. Statistical Models in Prognostic Modelling of Many Skewed Variables and Missing Data: A Case Study in Breast Cancer. (PhD thesis submitted at Edinburgh University) 2009
-
- Donner A. The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values. American Statisticians. 1982;36:378–81. doi: 10.2307/2683092. - DOI
LinkOut - more resources
Full Text Sources