Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
- PMID: 9744903
- DOI: 10.1162/089976698300017197
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
Abstract
This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These tests are compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar's test, is shown to have low type I error. The fifth test is a new test, 5 x 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5 x 2 cv test is shown to be slightly more powerful than McNemar's test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar's test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 x 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.
Similar articles
-
Combined 5 x 2 cv F test for comparing supervised classification learning algorithms.Neural Comput. 1999 Nov 15;11(8):1885-92. doi: 10.1162/089976699300016007. Neural Comput. 1999. PMID: 10578036
-
Weighted McNemar's test for the comparison of two screening tests in the presence of verification bias.Stat Med. 2022 Jul 20;41(16):3149-3163. doi: 10.1002/sim.9409. Epub 2022 Apr 15. Stat Med. 2022. PMID: 35428039
-
Blocked 3×2 cross-validated t-test for comparing supervised classification learning algorithms.Neural Comput. 2014 Jan;26(1):208-35. doi: 10.1162/NECO_a_00532. Epub 2013 Oct 8. Neural Comput. 2014. PMID: 24102129
-
Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine.Sports Med. 1998 Oct;26(4):217-38. doi: 10.2165/00007256-199826040-00002. Sports Med. 1998. PMID: 9820922 Review.
-
Take-Home Training in Laparoscopy.Dan Med J. 2017 Apr;64(4):B5335. Dan Med J. 2017. PMID: 28385174 Review.
Cited by
-
Wavelet radiomics features from multiphase CT images for screening hepatocellular carcinoma: analysis and comparison.Sci Rep. 2023 Nov 10;13(1):19559. doi: 10.1038/s41598-023-46695-8. Sci Rep. 2023. PMID: 37950031 Free PMC article.
-
Classification of pallidal oscillations with increasing parkinsonian severity.J Neurophysiol. 2015 Jul;114(1):209-18. doi: 10.1152/jn.00840.2014. Epub 2015 Apr 15. J Neurophysiol. 2015. PMID: 25878156 Free PMC article.
-
Predicting early-onset COPD risk in adults aged 20-50 using electronic health records and machine learning.PeerJ. 2024 Feb 23;12:e16950. doi: 10.7717/peerj.16950. eCollection 2024. PeerJ. 2024. PMID: 38410800 Free PMC article.
-
Beyond hand-crafted features for pretherapeutic molecular status identification of pediatric low-grade gliomas.Sci Rep. 2024 Aug 17;14(1):19102. doi: 10.1038/s41598-024-69870-x. Sci Rep. 2024. PMID: 39154039 Free PMC article.
-
Learning Using Partially Available Privileged Information and Label Uncertainty: Application in Detection of Acute Respiratory Distress Syndrome.IEEE J Biomed Health Inform. 2021 Mar;25(3):784-796. doi: 10.1109/JBHI.2020.3008601. Epub 2021 Mar 5. IEEE J Biomed Health Inform. 2021. PMID: 32750956 Free PMC article.
LinkOut - more resources
Full Text Sources
Other Literature Sources