Speech emotion recognition with light weight deep neural ensemble model using hand crafted features

Jaher Hassan Chowdhury¹, Sheela Ramanna², Ketan Kotecha³

Affiliations

¹ The University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, Canada.
² The University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, Canada. s.ramanna@uwinnipeg.ca.
³ Symbiosis International (Deemed University), Pune, Maharashtra, 412115, India.

PMID: 40195486
PMCID: PMC11977261
DOI: 10.1038/s41598-025-95734-z

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features

Jaher Hassan Chowdhury et al. Sci Rep. 2025.

. 2025 Apr 7;15(1):11824.

doi: 10.1038/s41598-025-95734-z.

Authors

Jaher Hassan Chowdhury¹, Sheela Ramanna², Ketan Kotecha³

Affiliations

¹ The University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, Canada.
² The University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, Canada. s.ramanna@uwinnipeg.ca.
³ Symbiosis International (Deemed University), Pune, Maharashtra, 412115, India.

PMID: 40195486
PMCID: PMC11977261
DOI: 10.1038/s41598-025-95734-z

Abstract

Automatic emotion detection has become crucial in various domains, such as healthcare, neuroscience, smart home technologies, and human-computer interaction (HCI). Speech Emotion Recognition (SER) has attracted considerable attention because of its potential to improve conversational robotics and human-computer interaction (HCI) systems. Despite its promise, SER research faces challenges such as data scarcity, the subjective nature of emotions, and complex feature extraction methods. In this paper, we seek to investigate whether a lightweight deep neural ensemble model (CNN and CNN_Bi-LSTM) using well-known hand-crafted features such as ZCR, RMSE, Chroma STFT, and MFCC would outperform models that use automatic feature extraction techniques (e.g., spectrogram-based methods) on benchmarked datasets. The focus of this paper is on the effectiveness of careful fine-tuning of the neural models with learning rate (LR) schedulers and applying regularization techniques. Our proposed ensemble model is validated using five publicly available datasets: RAVDESS, TESS, SAVEE, CREMA-D, and EmoDB. Accuracy, AUC-ROC, AUC-PRC, and F1-score metrics were used for performance testing, and the LIME (Local Interpretable Model-agnostic Explanations) technique was used for interpreting the results of our proposed ensemble model. Results indicate that our ensemble model consistently outperforms individual models, as well as several compared models which include spectrogram-based models for the above datasets in terms of the evaluation metrics.

Keywords: Audio signal processing; Averaging ensemble; Bi-directional LSTM; Convolutional neural network; Speech emotion recognition.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Overview of the proposed approach.

**Fig. 2**
Data augmentation and feature scaling process.

**Fig. 3**
Training and validation loss while training 1D CNN model on SAVEE dataset.

**Fig. 4**
LIME explanations of model predictions across different datasets.

See this image and copyright information in PMC

References

1. Ekman, P. Cross-cultural studies of facial expression. In Darwin and Facial Expression: A Century of Research in Review (ed. Ekman, P.) 169–222 (Academic Press, New York, 1973).
1. Ragsdale, J. W., Van Deusen, R., Rubio, D. & Spagnoletti, C. Recognizing patients’ emotions: teaching health care providers to interpret facial expressions. Acad. Med.91, 1270–1275 (2016). - PubMed
1. Suhaimi, N. S. et al. EEG-based emotion recognition: A state-of-the-art review of current trends and opportunities. Comput. Intell. Neurosci.2020, 1–19. 10.1155/2020/8875426 (2020). - PMC - PubMed
1. Baek, J.-Y. & Lee, S.-P. Enhanced speech emotion recognition using DCGAN-based data augmentation. Electronics12, 3966 (2023).
1. Zavarez, M. V., Berriel, R. F. & Oliveira-Santos, T. Cross-database facial expression recognition based on fine-tuned deep convolutional network. In 2017 30th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), 405–412, 10.1109/SIBGRAPI.2017.60 (2017).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features

Affiliations

Speech emotion recognition with light weight deep neural ensemble model using hand crafted features

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials