Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 28;14(1):31058.
doi: 10.1038/s41598-024-82253-6.

Clustering and classification for dry bean feature imbalanced data

Affiliations

Clustering and classification for dry bean feature imbalanced data

Chou-Yuan Lee et al. Sci Rep. .

Abstract

The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.

Keywords: BLSMOTE; Decision tree; Imbalanced data; K-means; Random forest; Support vector machine.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The process of K-means heuristic approach.
Fig. 2
Fig. 2
The category 1–7 sampling diagram of the target variable.
Fig. 3
Fig. 3
The flow chart of BLSMOTE + K-means + machine learning approaches.
Fig. 4
Fig. 4
The ROC curve and AUC value of BLSMOTE + K-means + SVM for dry bean dataset.
Fig. 5
Fig. 5
The ROC curve and AUC value of BLSMOTE + K-means + SVM for obesity levels dataset.
Fig. 6
Fig. 6
The dry bean data is divided into 7 clusters using feature (Area) through K-means.
Fig. 7
Fig. 7
The clustering diagram of dry bean dataset.
Fig. 8
Fig. 8
The obesity levels data are divided into 7 clusters using feature (Weight) through K-mean.
Fig. 9
Fig. 9
The clustering diagram of obesity levels dataset.
Fig. 10
Fig. 10
The BLSMOTE + K-means + DT training set decision diagram of dry bean dataset.
Fig. 11
Fig. 11
The BLSMOTE + K-means + DT training set decision diagram of obesity levels dataset.
Fig. 12
Fig. 12
The average impurity reduction value of dry bean features of the BLSMOTE + K-means + RF.
Fig. 13
Fig. 13
The BLSMOTE + K-means + RF feature importance ranking diagram of dry bean.
Fig. 14
Fig. 14
The average impurity reduction value of obesity levels features of the BLSMOTE +K-means + RF.
Fig. 15
Fig. 15
The BLSMOTE + K-means + RF feature importance ranking diagram of obesity levels dataset.

Similar articles

References

    1. Carreño Siqueira, J. A. et al. The use of photosynthetic pigments and SPAD can help in the selection of bean genotypes under fertilization organic and mineral. Sci. Rep.13, 22610 (2023). - PMC - PubMed
    1. Rodrıguez-Pulido, F. J. et al. Research progress in imaging technology for assessing quality in wine grapes and seeds. Foods11, 254 (2022). - PMC - PubMed
    1. Shahoveisi, F. & Riahi Manesh, M. Del Río Mendoza, L.E. modeling risk of Sclerotinia sclerotiorum-induced disease development on canola and dry bean using machine learning algorithms. Sci. Rep.12, 864 (2022). - PMC - PubMed
    1. Mendigoria, C. H. et al. Seed architectural phenes prediction and variety classification of dry beans using machine learning algorithms. IEEE 9th Reg. 10 Humanitarian Technol. Conf., 1–6 (2021).
    1. Debnath, T. & Nakamoto, T. Predicting individual perceptual scent impression from imbalanced dataset using mass spectrum of odorant molecules. Sci. Rep.12, 3778 (2022). - PMC - PubMed

LinkOut - more resources