Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation
- PMID: 30393425
- PMCID: PMC6191021
- DOI: 10.1007/s10994-018-5714-4
Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation
Abstract
Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822-829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.
Keywords: Bias correction; Cross-validation; Hyper-parameter optimization; Performance estimation.
Figures
Similar articles
-
Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models.Stat Methods Med Res. 2017 Apr;26(2):796-808. doi: 10.1177/0962280214558972. Epub 2014 Nov 19. Stat Methods Med Res. 2017. PMID: 25411322 Free PMC article.
-
Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation.BMC Med Res Methodol. 2016 Oct 26;16(1):144. doi: 10.1186/s12874-016-0239-7. BMC Med Res Methodol. 2016. PMID: 27782817 Free PMC article.
-
Bias in error estimation when using cross-validation for model selection.BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91. BMC Bioinformatics. 2006. PMID: 16504092 Free PMC article.
-
Applications of Monte Carlo Simulation in Modelling of Biochemical Processes.In: Mode CJ, editor. Applications of Monte Carlo Methods in Biology, Medicine and Other Fields of Science [Internet]. Rijeka (HR): InTech; 2011 Feb 28. Chapter 4. In: Mode CJ, editor. Applications of Monte Carlo Methods in Biology, Medicine and Other Fields of Science [Internet]. Rijeka (HR): InTech; 2011 Feb 28. Chapter 4. PMID: 28045483 Free Books & Documents. Review.
-
Guidelines for selecting among different types of bootstraps.Curr Med Res Opin. 2006 Apr;22(4):799-808. doi: 10.1185/030079906X100230. Curr Med Res Opin. 2006. PMID: 16684441 Review.
Cited by
-
Multiparametric MRI for Prostate Cancer Characterization: Combined Use of Radiomics Model with PI-RADS and Clinical Parameters.Cancers (Basel). 2020 Jul 2;12(7):1767. doi: 10.3390/cancers12071767. Cancers (Basel). 2020. PMID: 32630787 Free PMC article.
-
A characteristic cerebellar biosignature for bipolar disorder, identified with fully automatic machine learning.IBRO Neurosci Rep. 2023 Jul 1;15:77-89. doi: 10.1016/j.ibneur.2023.06.008. eCollection 2023 Dec. IBRO Neurosci Rep. 2023. PMID: 38025660 Free PMC article.
-
NTAL is associated with treatment outcome, cell proliferation and differentiation in acute promyelocytic leukemia.Sci Rep. 2020 Jun 25;10(1):10315. doi: 10.1038/s41598-020-66223-2. Sci Rep. 2020. PMID: 32587277 Free PMC article.
-
Tissue-Specific Methylation Biosignatures for Monitoring Diseases: An In Silico Approach.Int J Mol Sci. 2022 Mar 9;23(6):2959. doi: 10.3390/ijms23062959. Int J Mol Sci. 2022. PMID: 35328380 Free PMC article.
-
The leap to ordinal: Detailed functional prognosis after traumatic brain injury with a flexible modelling approach.PLoS One. 2022 Jul 5;17(7):e0270973. doi: 10.1371/journal.pone.0270973. eCollection 2022. PLoS One. 2022. PMID: 35788768 Free PMC article.
References
-
- Adamou, M., Antoniou, G., Greasidou, E., Lagani, V., Charonyktakis, P., Tsamardinos, I., & Doyle, M. Towards automatic risk assessment to support suicide prevention. Crisis (to appear) - PubMed
-
- Adamou, M., Antoniou, G., Greasidou, E., Lagani, V., Charonyktakis, P., Tsamardinos, I., & Doyle, M. (2018). Mining free-text medical notes for suicide risk assessment. In: Proceedings of the 10th hellenic conference on artificial intelligence, SETN 2018, Patras, Greece, July 9-15, 2018. ACM.
-
- Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19(6):716–723. doi: 10.1109/TAC.1974.1100705. - DOI
-
- Borboudakis G, Stergiannakos T, Frysali M, Klontzas E, Tsamardinos I, Froudakis GE. Chemically intuited, large-scale screening of MOFs by machine learning techniques. npj Computational Materials. 2017;3(1):40. doi: 10.1038/s41524-017-0045-8. - DOI
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources