. 2024 Dec 28;14(1):31058.

doi: 10.1038/s41598-024-82253-6.

Clustering and classification for dry bean feature imbalanced data

Chou-Yuan Lee¹, Wei Wang², Jian-Qiong Huang³

Affiliations

¹ School of Big Data, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China. lqy@fzfu.edu.cn.
² School of Software, Yunnan University, Kunming, 650000, China.
³ School of Big Data, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China.

PMID: 39730714
PMCID: PMC11681048
DOI: 10.1038/s41598-024-82253-6

Clustering and classification for dry bean feature imbalanced data

Chou-Yuan Lee et al. Sci Rep. 2024.

. 2024 Dec 28;14(1):31058.

doi: 10.1038/s41598-024-82253-6.

Authors

Chou-Yuan Lee¹, Wei Wang², Jian-Qiong Huang³

Affiliations

¹ School of Big Data, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China. lqy@fzfu.edu.cn.
² School of Software, Yunnan University, Kunming, 650000, China.
³ School of Big Data, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China.

PMID: 39730714
PMCID: PMC11681048
DOI: 10.1038/s41598-024-82253-6

Abstract

The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.

Keywords: BLSMOTE; Decision tree; Imbalanced data; K-means; Random forest; Support vector machine.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
The process of K-means heuristic approach.

**Fig. 2**
The category 1–7 sampling diagram of the target variable.

**Fig. 3**
The flow chart of BLSMOTE + K-means + machine learning approaches.

**Fig. 4**
The ROC curve and AUC value of BLSMOTE + K-means + SVM for dry bean dataset.

**Fig. 5**
The ROC curve and AUC value of BLSMOTE + K-means + SVM for obesity levels dataset.

**Fig. 6**
The dry bean data is divided into 7 clusters using feature (Area) through K-means.

**Fig. 7**
The clustering diagram of dry bean dataset.

**Fig. 8**
The obesity levels data are divided into 7 clusters using feature (Weight) through K-mean.

**Fig. 9**
The clustering diagram of obesity levels dataset.

**Fig. 10**
The BLSMOTE + K-means + DT training set decision diagram of dry bean dataset.

**Fig. 11**
The BLSMOTE + K-means + DT training set decision diagram of obesity levels dataset.

**Fig. 12**
The average impurity reduction value of dry bean features of the BLSMOTE + K-means + RF.

**Fig. 13**
The BLSMOTE + K-means + RF feature importance ranking diagram of dry bean.

**Fig. 14**
The average impurity reduction value of obesity levels features of the BLSMOTE +K-means + RF.

**Fig. 15**
The BLSMOTE + K-means + RF feature importance ranking diagram of obesity levels dataset.

See this image and copyright information in PMC

References

1. Carreño Siqueira, J. A. et al. The use of photosynthetic pigments and SPAD can help in the selection of bean genotypes under fertilization organic and mineral. Sci. Rep.13, 22610 (2023). - PMC - PubMed
1. Rodrıguez-Pulido, F. J. et al. Research progress in imaging technology for assessing quality in wine grapes and seeds. Foods11, 254 (2022). - PMC - PubMed
1. Shahoveisi, F. & Riahi Manesh, M. Del Río Mendoza, L.E. modeling risk of Sclerotinia sclerotiorum-induced disease development on canola and dry bean using machine learning algorithms. Sci. Rep.12, 864 (2022). - PMC - PubMed
1. Mendigoria, C. H. et al. Seed architectural phenes prediction and variety classification of dry beans using machine learning algorithms. IEEE 9th Reg. 10 Humanitarian Technol. Conf., 1–6 (2021).
1. Debnath, T. & Nakamoto, T. Predicting individual perceptual scent impression from imbalanced dataset using mass spectrum of odorant molecules. Sci. Rep.12, 3778 (2022). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering and classification for dry bean feature imbalanced data

Affiliations

Clustering and classification for dry bean feature imbalanced data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources