SMOTE for high-dimensional class-imbalanced data

Rok Blagus¹, Lara Lusa

Affiliations

PMID: 23522326
PMCID: PMC3648438
DOI: 10.1186/1471-2105-14-106

SMOTE for high-dimensional class-imbalanced data

Rok Blagus et al. BMC Bioinformatics. 2013.

. 2013 Mar 22:14:106.

doi: 10.1186/1471-2105-14-106.

Authors

Rok Blagus¹, Lara Lusa

Affiliation

¹ Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia.

PMID: 23522326
PMCID: PMC3648438
DOI: 10.1186/1471-2105-14-106

Abstract

Background: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

Results: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

Conclusions: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

PubMed Disclaimer

Figures

**Figure 1**
**Effect of SMOTE and the number of variables on the Euclidean distance between test samples and training set samples.** Left panel: distribution of the Euclidean distance between test and training set samples (original or SMOTE); right panel: proportion of SMOTE samples selected as nearest neighbors of test samples.

**Figure 2**
**Classification results using low-dimensional data.** Predictive accuracy (overall (PA) and class-specific (PA₁, PA₂)) achieved with SMOTE (black symbols) or without any class-imbalance correction (NC gray symbols) for 7 types of classifiers, for different training set sample sizes (40, 80 or 200 samples).

**Figure 3**
**Null case classification results for high-dimensional data.** Class-specific predictive accuracies (PA₁, PA₂) achieved with SMOTE (blue symbols), without any class-imbalance correction (small, gray symbols) and with cut-off adjustment (large, gray symbols) for 7 types of classifiers, varying the proportion of Class 1 samples in the training set (k₁).

**Figure 4**
**Alternative hypothesis classification results for high-dimensional data.** Symbols as in Figure 3.

**Figure 5**
**Summary of results obtained on the simulated data.** Green and red color shading denote good and poor performance of the classifiers, respectively. Upwards and downwards trending arrows and the symbol ≈ denote improved, deteriorated or similar performance of the classifier when comparing SMOTE or adjusted classification threshold (CO) with the uncorrected analysis (NC).

**Figure 6**
**Class-specific predictive accuracies (PA**₁**, PA**₂**), AUC and G-mean for experimental data.** NC: No correction, original data used; CUT-OFF: results obtained by changing the classification threshold; UNDER: simple undersampling.

**Figure 7**
**Class-specific predictive accuracies for Sotiriou’s data, varying class imbalance.** Left panels: prediction of ER, ER- is the minority class. Right panel: prediction of grade, grade 3 is the minority class. The sample size of the minority class is fixed to n_min = 5 (upper panels) or n_min = 10 (lower panels), while it varies for the majority class.

See this image and copyright information in PMC

References

1. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics) New York: Springer; 2007.
1. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowledge Data Eng. 2009;21(9):1263–1284.
1. Daskalaki S, Kopanas I, Avouris N. Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell. 2006;20(5):381–417. doi: 10.1080/08839510500313653. - DOI
1. Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003;33:49–54. doi: 10.1038/ng1060. - DOI - PubMed
1. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8:68. doi: 10.1038/nm0102-68. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SMOTE for high-dimensional class-imbalanced data

Affiliation

SMOTE for high-dimensional class-imbalanced data

Authors

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources