Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 1;25(1):177.
doi: 10.1186/s12911-025-02978-w.

A hybrid approach for binary and multi-class classification of voice disorders using a pre-trained model and ensemble classifiers

Affiliations

A hybrid approach for binary and multi-class classification of voice disorders using a pre-trained model and ensemble classifiers

Mehtab Ur Rahman et al. BMC Med Inform Decis Mak. .

Abstract

Recent advances in artificial intelligence-based audio and speech processing have increasingly focused on the binary and multi-class classification of voice disorders. Despite progress, achieving high accuracy in multi-class classification remains challenging. This paper proposes a novel hybrid approach using a two-stage framework to enhance voice disorders classification performance, and achieve state-of-the-art accuracies in multi-class classification. Our hybrid approach, combines deep learning features with various powerful classifiers. In the first stage, high-level feature embeddings are extracted from voice data spectrograms using a pre-trained VGGish model. In the second stage, these embeddings are used as input to four different classifiers: Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and an Ensemble Classifier (EC). Experiments are conducted on a subset of the Saarbruecken Voice Database (SVD) for male, female, and combined speakers. For binary classification, VGGish-SVM achieved the highest accuracy for male speakers (82.45% for healthy vs. disordered; 75.45% for hyperfunctional dysphonia vs. vocal fold paresis), while VGGish-EC performed best for female speakers (71.54% for healthy vs. disordered; 68.42% for hyperfunctional dysphonia vs. vocal fold paresis). In multi-class classification, VGGish-SVM outperformed other models, achieving mean accuracies of 77.81% for male speakers, 63.11% for female speakers, and 70.53% for combined genders. We conducted a comparative analysis against related works, including the Mel frequency cepstral coefficient (MFCC), MFCC-glottal features, and features extracted using the wav2vec and HuBERT models with SVM classifier. Results demonstrate that our hybrid approach consistently outperforms these models, especially in multi-class classification tasks. The results show the feasibility of a hybrid framework for voice disorder classification, offering a foundation for refining automated tools that could support clinical assessments with further validation.

Keywords: Ensemble classifier; Multi-class classification; VGGish; Voice disorders.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Clinical trial number: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The proposed voice disorders classification system
Fig. 2
Fig. 2
VGGish model architecture
Fig. 3
Fig. 3
Ensemble classifier
Fig. 4
Fig. 4
Normalized confusion matrix for healthy vs. disordered. The predicted classes are represented on the horizontal axis, while the true classes are represented on the vertical axis. Class labels: 0 for healthy and 1 for disordered
Fig. 5
Fig. 5
Normalized confusion matrix for hyperfunctional dysphonia vs. vocal fold paresis. The predicted classes are represented on the horizontal axis, while the true classes are represented on the vertical axis. Class labels: 0 for hyperfunctional dysphonia and 1 for vocal fold paresis
Fig. 6
Fig. 6
Normalized confusion matrix for multi-class classification. The predicted classes are represented on the horizontal axis, while the true classes are represented on the vertical axis. Class labels: 0 for healthy, 1 for hyperfunctional dysphonia, and 2 for vocal fold paresis

References

    1. Ramig LO, Verdolini K. Treatment efficacy. J Speech Lang Hear Res. 1998;41(1):S101–S116. 10.1044/jslhr.4101.s101. - PubMed
    1. Robotti C, Mozzanica F, Barillari MR, Bono M, Cacioppo G, Dimattia F, Gitto M, Rocca S, Schindler A. Treatment of relapsing functional and organic dysphonia: a narrative literature review. Acta Otorhinolaryngol Ital. 2023;43(2 Suppl 1):84. 10.14639/0392-100x-suppl.1-43-2023-11. - PMC - PubMed
    1. American Speech-Language-Hearing Association. (n.d.).Voice disorders. (Practice Portal). Accessed 14 Sept 2023. https://www.asha.org/practice-portal/clinical-topics/voice-disorders/
    1. Ribas D, Pastor MA, Miguel A, Martnez D, Ortega A, Lleida E. Automatic voice disorder detection using self-supervised representations. IEE Access. 2023;11:14915–27. 10.1109/ACCESS.2023.3243986.
    1. Xie Y, Ruiyu L, Liang Z, Huang C, Zou C, Schuller B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans Audio Speech Lang Process. 2019;27:1–1. 10.1109/TASLP.2019.2925934.

LinkOut - more resources