Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb 23:14:64.
doi: 10.1186/1471-2105-14-64.

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

Affiliations

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

Rok Blagus et al. BMC Bioinformatics. .

Abstract

Background: PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate.

Results: We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means).

Conclusions: The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Probability of classification of a new sample in the minority class and the classification error as a function of the number of variables. The figure shows the probability of classification of a new sample in the minority class (left panel) and the classification error (right panel) as a function of the number of variables for the example presented in the main text.
Figure 2
Figure 2
Classification results under the alternative hypothesis for the NSC and GM-NSC classifiers. The figure shows class specific predictive accuracies (PA1 and PA2) for different levels of class-imbalance (k1) in the training set. The differences between the classes were small (upper panel: μ2=0.5) or moderate (lower panel: μ2=1). See text for details.
Figure 3
Figure 3
Classification results on the Sotiriou data set. The figure shows PA for ER+ class (PAER+) and PA for ER- class (PAER-) for different number of ER+ samples in the training set (nER+). There were 10 ER- samples in each training set. See text for more details.

References

    1. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. edition. New York: Springer; 2007.
    1. Weigelt B, Pusztai L, Ashworth A, Reis-Filho JS. Challenges translating breast cancer gene signatures into the clinic. Nat Rev Clin Oncol. 2012;9:58–64. - PubMed
    1. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA. 2002;99(10):6567–6572. doi: 10.1073/pnas.082099299. - DOI - PMC - PubMed
    1. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97(457):77–87. doi: 10.1198/016214502753479248. - DOI
    1. Tibshirani R, Hastie T, Narasimhan B, Chu G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci. 2003;18:104–117. doi: 10.1214/ss/1056397488. - DOI