. 2017 Aug 3;12(8):e0181853.

doi: 10.1371/journal.pone.0181853. eCollection 2017.

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

Der-Chiang Li¹, Susan C Hu², Liang-Sian Lin³, Chun-Wu Yeh⁴

Affiliations

¹ Department of Industrial and Information Management, College of Management, National Cheng Kung University, Tainan City, Taiwan, R.O.C.
² Department of Public Health, College of Medicine, National Cheng Kung University, Tainan City, Taiwan, R.O.C.
³ Information and Communications Research Laboratories, Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C.
⁴ Department of Information Management, College of Information Technology, Kun Shan University, Yongkang Dist., Tainan City, Taiwan.

PMID: 28771522
PMCID: PMC5542532
DOI: 10.1371/journal.pone.0181853

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

Der-Chiang Li et al. PLoS One. 2017.

. 2017 Aug 3;12(8):e0181853.

doi: 10.1371/journal.pone.0181853. eCollection 2017.

Authors

Der-Chiang Li¹, Susan C Hu², Liang-Sian Lin³, Chun-Wu Yeh⁴

Affiliations

¹ Department of Industrial and Information Management, College of Management, National Cheng Kung University, Tainan City, Taiwan, R.O.C.
² Department of Public Health, College of Medicine, National Cheng Kung University, Tainan City, Taiwan, R.O.C.
³ Information and Communications Research Laboratories, Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C.
⁴ Department of Information Management, College of Information Technology, Kun Shan University, Yongkang Dist., Tainan City, Taiwan.

PMID: 28771522
PMCID: PMC5542532
DOI: 10.1371/journal.pone.0181853

Abstract

It is difficult for learning models to achieve high classification performances with imbalanced data sets, because with imbalanced data sets, when one of the classes is much larger than the others, most machine learning and data mining classifiers are overly influenced by the larger classes and ignore the smaller ones. As a result, the classification algorithms often have poor learning performances due to slow convergence in the smaller classes. To balance such data sets, this paper presents a strategy that involves reducing the sizes of the majority data and generating synthetic samples for the minority data. In the reducing operation, we use the box-and-whisker plot approach to exclude outliers and the Mega-Trend-Diffusion method to find representative data from the majority data. To generate the synthetic samples, we propose a counterintuitive hypothesis to find the distributed shape of the minority data, and then produce samples according to this distribution. Four real datasets were used to examine the performance of the proposed approach. We used paired t-tests to compare the Accuracy, G-mean, and F-measure scores of the proposed data pre-processing (PPDP) method merging in the D3C method (PPDP+D3C) with those of the one-sided selection (OSS), the well-known SMOTEBoost (SB) study, and the normal distribution-based oversampling (NDO) approach, and the proposed data pre-processing (PPDP) method. The results indicate that the classification performance of the proposed approach is better than that of above-mentioned methods.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Fig 2. The proposed procedure for learning imbalanced data sets.**

**Fig 3. The testing procedure for imbalanced data sets.**

See this image and copyright information in PMC

Cited by

Exploring the Interplay of Dataset Size and Imbalance on CNN Performance in Healthcare: Using X-rays to Identify COVID-19 Patients.
Davidian M, Lahav A, Joshua BZ, Wand O, Lurie Y, Mark S. Davidian M, et al. Diagnostics (Basel). 2024 Aug 8;14(16):1727. doi: 10.3390/diagnostics14161727. Diagnostics (Basel). 2024. PMID: 39202215 Free PMC article.
Radiologist observations of computed tomography (CT) images predict treatment outcome in TB Portals, a real-world database of tuberculosis (TB) cases.
Rosenfeld G, Gabrielian A, Wang Q, Gu J, Hurt DE, Long A, Rosenthal A. Rosenfeld G, et al. PLoS One. 2021 Mar 17;16(3):e0247906. doi: 10.1371/journal.pone.0247906. eCollection 2021. PLoS One. 2021. PMID: 33730021 Free PMC article.
Prediction of Neurological Outcomes in Out-of-hospital Cardiac Arrest Survivors Immediately after Return of Spontaneous Circulation: Ensemble Technique with Four Machine Learning Models.
Heo JH, Kim T, Shin J, Suh GJ, Kim J, Jung YS, Park SM, Kim S; For SNU CARE investigators. Heo JH, et al. J Korean Med Sci. 2021 Jul 19;36(28):e187. doi: 10.3346/jkms.2021.36.e187. J Korean Med Sci. 2021. PMID: 34282605 Free PMC article.
Cardiovascular Disease Prediction by Machine Learning Algorithms Based on Cytokines in Kazakhs of China.
Jiang Y, Zhang X, Ma R, Wang X, Liu J, Keerman M, Yan Y, Ma J, Song Y, Zhang J, He J, Guo S, Guo H. Jiang Y, et al. Clin Epidemiol. 2021 Jun 9;13:417-428. doi: 10.2147/CLEP.S313343. eCollection 2021. Clin Epidemiol. 2021. PMID: 34135637 Free PMC article.
Cellular frustration algorithms for anomaly detection applications.
Faria B, Vistulo de Abreu F. Faria B, et al. PLoS One. 2019 Jul 8;14(7):e0218930. doi: 10.1371/journal.pone.0218930. eCollection 2019. PLoS One. 2019. PMID: 31283758 Free PMC article.

See all "Cited by" articles

References

1. Murphey YL, Guo H, Feldkamp LA. Neural learning from unbalanced data. Applied Intelligence. 2004;21(2):117–28.
1. Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A. Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med. 2006;37(1):7–18. doi: 10.1016/j.artmed.2005.03.002 . - DOI - PubMed
1. Sun Y, Kamel MS, Wong AKC, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition. 2007;40(12):3358–78. doi: 10.1016/j.patcog.2007.04.009 - DOI
1. Sun Y, Wong AK, Kamel MS. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence. 2009;23(04):687–719.
1. Li DC, Liu CW, Hu SC. A learning method for the class imbalance problem with medical data sets. Comput Biol Med. 2010;40(5):509–18. doi: 10.1016/j.compbiomed.2010.03.005 . - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

Affiliations

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials