A hybrid approach for binary and multi-class classification of voice disorders using a pre-trained model and ensemble classifiers
- PMID: 40312383
- PMCID: PMC12044829
- DOI: 10.1186/s12911-025-02978-w
A hybrid approach for binary and multi-class classification of voice disorders using a pre-trained model and ensemble classifiers
Abstract
Recent advances in artificial intelligence-based audio and speech processing have increasingly focused on the binary and multi-class classification of voice disorders. Despite progress, achieving high accuracy in multi-class classification remains challenging. This paper proposes a novel hybrid approach using a two-stage framework to enhance voice disorders classification performance, and achieve state-of-the-art accuracies in multi-class classification. Our hybrid approach, combines deep learning features with various powerful classifiers. In the first stage, high-level feature embeddings are extracted from voice data spectrograms using a pre-trained VGGish model. In the second stage, these embeddings are used as input to four different classifiers: Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and an Ensemble Classifier (EC). Experiments are conducted on a subset of the Saarbruecken Voice Database (SVD) for male, female, and combined speakers. For binary classification, VGGish-SVM achieved the highest accuracy for male speakers (82.45% for healthy vs. disordered; 75.45% for hyperfunctional dysphonia vs. vocal fold paresis), while VGGish-EC performed best for female speakers (71.54% for healthy vs. disordered; 68.42% for hyperfunctional dysphonia vs. vocal fold paresis). In multi-class classification, VGGish-SVM outperformed other models, achieving mean accuracies of 77.81% for male speakers, 63.11% for female speakers, and 70.53% for combined genders. We conducted a comparative analysis against related works, including the Mel frequency cepstral coefficient (MFCC), MFCC-glottal features, and features extracted using the wav2vec and HuBERT models with SVM classifier. Results demonstrate that our hybrid approach consistently outperforms these models, especially in multi-class classification tasks. The results show the feasibility of a hybrid framework for voice disorder classification, offering a foundation for refining automated tools that could support clinical assessments with further validation.
Keywords: Ensemble classifier; Multi-class classification; VGGish; Voice disorders.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Clinical trial number: Not applicable. Competing interests: The authors declare no competing interests.
Figures






References
-
- Ramig LO, Verdolini K. Treatment efficacy. J Speech Lang Hear Res. 1998;41(1):S101–S116. 10.1044/jslhr.4101.s101. - PubMed
-
- American Speech-Language-Hearing Association. (n.d.).Voice disorders. (Practice Portal). Accessed 14 Sept 2023. https://www.asha.org/practice-portal/clinical-topics/voice-disorders/
-
- Ribas D, Pastor MA, Miguel A, Martnez D, Ortega A, Lleida E. Automatic voice disorder detection using self-supervised representations. IEE Access. 2023;11:14915–27. 10.1109/ACCESS.2023.3243986.
-
- Xie Y, Ruiyu L, Liang Z, Huang C, Zou C, Schuller B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans Audio Speech Lang Process. 2019;27:1–1. 10.1109/TASLP.2019.2925934.
MeSH terms
LinkOut - more resources
Full Text Sources
Medical