Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep;59(5):948-966.
doi: 10.1002/bimj.201600207. Epub 2017 Jun 19.

Multivariate binary classification of imbalanced datasets-A case study based on high-dimensional multiplex autoimmune assay data

Affiliations

Multivariate binary classification of imbalanced datasets-A case study based on high-dimensional multiplex autoimmune assay data

Laura Schlieker et al. Biom J. 2017 Sep.

Abstract

The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.

Keywords: Cost-sensitive learning; Imbalanced data; PPLS-DA; Random Forests; Sampling.

PubMed Disclaimer

LinkOut - more resources