. 2024 Jun 7;14(1):13126.

doi: 10.1038/s41598-024-63776-4.

An enhanced speech emotion recognition using vision transformer

Samson Akinpelu¹, Serestina Viriri², Adekanmi Adegun¹

Affiliations

¹ School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa.
² School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa. viriris@ukzn.ac.za.

PMID: 38849422
PMCID: PMC11161461
DOI: 10.1038/s41598-024-63776-4

An enhanced speech emotion recognition using vision transformer

Samson Akinpelu et al. Sci Rep. 2024.

. 2024 Jun 7;14(1):13126.

doi: 10.1038/s41598-024-63776-4.

Authors

Samson Akinpelu¹, Serestina Viriri², Adekanmi Adegun¹

Affiliations

¹ School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa.
² School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa. viriris@ukzn.ac.za.

PMID: 38849422
PMCID: PMC11161461
DOI: 10.1038/s41598-024-63776-4

Abstract

In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model's capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.

Keywords: CNN; Deep learning; Human–computer interaction; Mel spectrogram; Speech emotion recognition; Vision transformer.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Traditional speech emotion recognition framework.

**Figure 2**
Propose Vision Transformer Architectural Framework.

**Figure 3**
Mel-spectrogram of selected emotion.

**Figure 4**
TESS dataset emotion distribution.

**Figure 5**
EMODB dataset emotion distribution.

**Figure 6**
The figure illustrates the proposed model’s performance loss curve for the three benchmarked datasets. (a) Loss diagram on TESS dataset (b) Loss diagram on EMODB dataset and (c) Loss diagram on TESS-EMODB dataset.

**Figure 7**
Summary of classification report for F1-Score, Recall and Precision.

**Figure 8**
Confusion matrix for TESS, EMODB and TESS-EMODB.

**Figure 9**
Test sample of emotion recognition output of the proposed model on three datasets: (i) represents recognition output on TESS dataset (ii) represents recognition output on EMODB dataset (iii) represent recognition output on TESS-EMODB dataset.

See this image and copyright information in PMC

Cited by

Dual prompt personalized federated learning in foundation models.
Chang Y, Shi X, Zhao X, Chen Z, Ma D. Chang Y, et al. Sci Rep. 2025 Jul 31;15(1):28026. doi: 10.1038/s41598-025-11864-4. Sci Rep. 2025. PMID: 40745444 Free PMC article.
Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.
Zhao S, Ren J, Zhou X. Zhao S, et al. Sci Rep. 2025 Aug 16;15(1):30004. doi: 10.1038/s41598-025-11989-6. Sci Rep. 2025. PMID: 40819129 Free PMC article.

References

1. Alsabhan W. Human-computer interaction with a real-time speech emotion recognition with ensembling techniques 1d. Sensors (Switzerland) 2023;23(1386):1–21. doi: 10.3390/s2303138. - DOI - PMC - PubMed
1. Yahia, A. C., Moussaoui, Frahta, N. & Moussaoui, A. Effective speech emotion recognition using deep learning approaches for Algerian Dialect. In In Proc. Intl. Conf. of Women in Data Science at Taif University, WiDSTaif 1–6 (2021). 10.1109/WIDSTAIF52235.2021.9430224
1. Blackwell, A. Human Computer Interaction-Lecture Notes Cambridge Computer Science Tripos, Part II. https://www.cl.cam.ac.uk/teaching/1011/HCI/HCI2010.pdf (2010)
1. Muthusamy, K. H., Polat, Yaacob, S. Improved emotion recognition using gaussian mixture model and extreme learning machine in speech and glottal signals. Math. Probl. Eng. (2015). 10.1155/2015/394083
1. Xie J, Zhu M, Hu K. Fusion-based speech emotion classification using two-stage feature selection. Speech Commun. 2023;66(6):102955. doi: 10.1016/j.specom.2023.102955. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An enhanced speech emotion recognition using vision transformer

Affiliations

An enhanced speech emotion recognition using vision transformer

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Miscellaneous