. 2012 Feb 15:12:8.

doi: 10.1186/1472-6947-12-8.

Predicting sample size required for classification performance

Rosa L Figueroa¹, Qing Zeng-Treitler, Sasikiran Kandula, Long H Ngo

Affiliations

PMID: 22336388
PMCID: PMC3307431
DOI: 10.1186/1472-6947-12-8

Predicting sample size required for classification performance

Rosa L Figueroa et al. BMC Med Inform Decis Mak. 2012.

. 2012 Feb 15:12:8.

doi: 10.1186/1472-6947-12-8.

Authors

Rosa L Figueroa¹, Qing Zeng-Treitler, Sasikiran Kandula, Long H Ngo

Affiliation

¹ Dep. Ing. Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile.

PMID: 22336388
PMCID: PMC3307431
DOI: 10.1186/1472-6947-12-8

Abstract

Background: Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target.

Methods: We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method.

Results: A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05).

Conclusions: This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.

PubMed Disclaimer

Figures

**Figure 1**
**Generic learning curve**.

**Figure 2**
**Progression of online curve fitting for learning curve of the dataset D2-RAND**.

**Figure 3**
**Progression of confidence interval width and MAE for predicted values**.

**Figure 4**
**RMSE for predicted values on the three datasets**.

**Figure 5**
**Progression of confidence interval widths for the observed values (training set) and the predicted values**.

See this image and copyright information in PMC

References

1. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP. Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol. 2003;10(2):119–142. doi: 10.1089/106652703321825928. - DOI - PubMed
1. Dobbin K, Zhao Y, Simon R. How Large a Training Set is Needed to Develop a Classifier for Microarray Data? Clinical Cancer Research. 2008;14(1):108–114. doi: 10.1158/1078-0432.CCR-07-0443. - DOI - PubMed
1. Tam VH, Kabbara S, Yeh RF, Leary RH. Impact of sample size on the performance of multiple-model pharmacokinetic simulations. Antimicrobial agents and chemotherapy. 2006;50(11):3950–3952. doi: 10.1128/AAC.00337-06. - DOI - PMC - PubMed
1. Kim S-Y. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC bioinformatics. 2009;10(1):147. doi: 10.1186/1471-2105-10-147. - DOI - PMC - PubMed
1. Kalayeh HM, Landgrebe DA. Predicting the Required Number of Training Samples. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1983;5(6):664–667. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- ClinicalTrials.gov
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting sample size required for classification performance

Affiliation

Predicting sample size required for classification performance

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Research Materials