. 2010 Feb;37(2):907-20.

doi: 10.1118/1.3284974.

Effect of finite sample size on feature selection and classification: a simulation study

Ted W Way¹, Berkman Sahiner, Lubomir M Hadjiiski, Heang-Ping Chan

Affiliations

PMID: 20229900
PMCID: PMC2826389
DOI: 10.1118/1.3284974

Effect of finite sample size on feature selection and classification: a simulation study

Ted W Way et al. Med Phys. 2010 Feb.

. 2010 Feb;37(2):907-20.

doi: 10.1118/1.3284974.

Authors

Ted W Way¹, Berkman Sahiner, Lubomir M Hadjiiski, Heang-Ping Chan

Affiliation

¹ Department of Radiology, University of Michigan, Ann Arbor, Michigan 48109-5842, USA.

PMID: 20229900
PMCID: PMC2826389
DOI: 10.1118/1.3284974

Abstract

Purpose: The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer-aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.

Methods: Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve Az. The mean Az values obtained by resubstitution and hold-out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.

Results: It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold-out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.

Conclusions: None of the investigated feature selection-classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.

PubMed Disclaimer

Figures

**Figure 1**
Dependence of the LDA classifier performance A_z on training sample size. The two class distributions were multivariate normal with equal covariance matrices and unequal means. The effect of increasing dimensionality of the feature space available for selection (M) is shown in each column. The comparison of the SFS, SFFS, and PCA methods for feature selection is shown in each row.

**Figure 2**
Dependence of the performance A_z of the SVM classifier with radial kernel on training sample size. The two class distributions were multivariate normal with equal covariance matrices and unequal means. The effect of increasing dimensionality of the feature space available for selection (M) is shown in each column. The comparison of the SFS, SFFS, and PCA methods for feature selection is shown in each row.

**Figure 3**
Dependence of the performance A_z of the SVM classifier with polynomial kernel on training sample size. The two class distributions were multivariate normal with equal covariance matrices and unequal means. The effect of increasing dimensionality of the feature space available for selection (M) is shown in each column. The comparison of the SFS, SFFS, and PCA methods for feature selection is shown in each row.

**Figure 4**
Standard deviation of the hold-out performance as a function of 1∕N_train for the SFS, SFFS, and PCA feature selection methods and the LDA classifier. The number of features available for selection was M=100 for the equal covariance matrices (first row) and unequal covariance matrices (second row) conditions, and M=61 for the condition with simulated equal covariance matrices estimated from a clinical data set.

**Figure 5**
Dependence of the performance A_z of the SVM classifier with radial kernel on training sample size. The two class distributions were multivariate normal with unequal covariance matrices and unequal means. The effect of increasing dimensionality of the feature space available for selection (M) is shown in each column. The comparison of the SFS, SFFS, and PCA methods for feature selection is shown in each row.

**Figure 6**
Comparison of the LDA, SVM(rad), and SVM(poly) classifiers with the same input features obtained from SFS. The two class distributions were multivariate normal with unequal covariance matrices and unequal means.

**Figure 7**
Performance of the SFS, SFFS, and PCA feature selection methods and the LDA, SVM(rad), and SVM(poly) classifiers for simulated multivariate normal class distributions with equal covariance matrices estimated from a clinical data set (M=61).

See this image and copyright information in PMC

Cited by

Computer-aided detection system for clustered microcalcifications in digital breast tomosynthesis using joint information from volumetric and planar projection images.
Samala RK, Chan HP, Lu Y, Hadjiiski LM, Wei J, Helvie MA. Samala RK, et al. Phys Med Biol. 2015 Nov 7;60(21):8457-79. doi: 10.1088/0031-9155/60/21/8457. Epub 2015 Oct 14. Phys Med Biol. 2015. PMID: 26464355 Free PMC article.
Mass detection in digital breast tomosynthesis: Deep convolutional neural network with transfer learning from mammography.
Samala RK, Chan HP, Hadjiiski L, Helvie MA, Wei J, Cha K. Samala RK, et al. Med Phys. 2016 Dec;43(12):6654. doi: 10.1118/1.4967345. Med Phys. 2016. PMID: 27908154 Free PMC article.
Personalized prediction model for seizure-free epilepsy with levetiracetam therapy: a retrospective data analysis using support vector machine.
Zhang JH, Han X, Zhao HW, Zhao D, Wang N, Zhao T, He GN, Zhu XR, Zhang Y, Han JY, Huang DL. Zhang JH, et al. Br J Clin Pharmacol. 2018 Nov;84(11):2615-2624. doi: 10.1111/bcp.13720. Epub 2018 Sep 3. Br J Clin Pharmacol. 2018. PMID: 30043454 Free PMC article.
Machine Learning for Medical Imaging.
Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Erickson BJ, et al. Radiographics. 2017 Mar-Apr;37(2):505-515. doi: 10.1148/rg.2017160130. Epub 2017 Feb 17. Radiographics. 2017. PMID: 28212054 Free PMC article. Review.
Integrating Rapid Evaporative Ionization Mass Spectrometry Classification with Matrix-Assisted Laser Desorption Ionization Mass Spectrometry Imaging and Liquid Chromatography-Tandem Mass Spectrometry to Unveil Glioblastoma Overall Survival Prediction.
Hendriks TFE, Birmpili A, de Vleeschouwer S, Heeren RMA, Cuypers E. Hendriks TFE, et al. ACS Chem Neurosci. 2025 Mar 19;16(6):1021-1033. doi: 10.1021/acschemneuro.4c00463. Epub 2025 Feb 25. ACS Chem Neurosci. 2025. PMID: 40007067 Free PMC article.

See all "Cited by" articles

References

1. Chan H. P., Sahiner B., Wagner R. F., and Petrick N., “Classifier design for computer-aided diagnosis: Effects of finite sample size on the mean performance of classical and neural network classifiers,” Med. Phys. MPHYA6 26, 2654–2668 (1999).10.1118/1.598805 - DOI - PubMed
1. Sahiner B., Chan H. P., Petrick N., Wagner R. F., and Hadjiiski L. M., “Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size,” Med. Phys. MPHYA6 27, 1509–1522 (2000).10.1118/1.599017 - DOI - PMC - PubMed
1. Sahiner B., Chan H. P., and Hadjiiski L., “Classifier performance prediction for computer-aided diagnosis using a limited data set,” Med. Phys. MPHYA6 35, 1559–1570 (2008).10.1118/1.2868757 - DOI - PMC - PubMed
1. Sahiner B., Chan H. P., and Hadjiiski L. M., “Classifier performance estimation under the constraint of a finite sample size: Resampling schemes applied to neural network classifiers,” Neural Networks NNETEB 21, 476–483 (2008).10.1016/j.neunet.2007.12.012 - DOI - PMC - PubMed
1. Li Q. and Doi K., “Analysis and minimization of overtraining effect in rule-based classifiers for computer-aided diagnosis,” Med. Phys. MPHYA6 33, 320–328 (2006).10.1118/1.1999126 - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Effect of finite sample size on feature selection and classification: a simulation study

Affiliation

Effect of finite sample size on feature selection and classification: a simulation study

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous