Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 4:16:363.
doi: 10.1186/s12859-015-0784-9.

Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

Affiliations

Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

Rok Blagus et al. BMC Bioinformatics. .

Abstract

Background: Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class).

Results: Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed.

Conclusions: We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Combination of sampling and CV methods used in the simulations and real data analyses. CV includes Sampling (first row) constitutes the correct approach, while Sampling followed by CV (second row) is the incorrect approach. The samples included in the original dataset are indicated using upper cases, while their copies are indicated with lower cases
Fig. 2
Fig. 2
Probability that at least one of the replicas of a sample included in the test fold is included also in the training fold, as a function of the proportion of minority class samples (p min). The figure shows how the probability that a test sample has a replica in the learning fold depends on the level of class-imbalance (p min) in a dataset with n=100 samples when using 2 fold CV
Fig. 3
Fig. 3
Cross-validated AUC for different sample sizes and classification rules obtained on simulated data. AUC obtained with different classification rules for simulated data with 10 variables (simulated independently from a Gaussian distribution with zero mean and unit variance) and 2 CV folds. There were n=100, 500, 1,000 and 100,00 samples
Fig. 4
Fig. 4
Cross-validated AUC for different UCI datasets. Datasets are ordered by their AUC obtained by correct CV
Fig. 5
Fig. 5
Cross-validated AUC for different gene expression microarray datasets datasets. Datasets are ordered by their AUC obtained by correct CV

References

    1. Collins G, Mallett S, Omar O, Yu LM. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 2011;9:103. doi: 10.1186/1741-7015-9-103. - DOI - PMC - PubMed
    1. Bouwmeester W, Zuithoff NP, Mallett S, Geerlings M, Vergouwe Y, Steyerberg E, et al. Reporting and methods in clinical prediction research: A systematic review. PLoS Med. 2012;9(5):1–12. doi: 10.1371/journal.pmed.1001221. - DOI - PMC - PubMed
    1. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84. doi: 10.1109/TKDE.2008.239. - DOI
    1. Radivojac P, Chawla NV, Dunker AK, Obradovic Z. Classification and knowledge discovery in protein databases. J Biomed Inform. 2004;37(4):224–39. doi: 10.1016/j.jbi.2004.07.008. - DOI - PubMed
    1. Taft L, Evans R, Shyu C, Egger M, Chawla N, Mitchell J, et al. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. J Biomed Inform. 2009;42(2):356–64. doi: 10.1016/j.jbi.2008.09.001. - DOI - PMC - PubMed

Publication types

LinkOut - more resources