Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 10;15(1):8254.
doi: 10.1038/s41598-025-92640-2.

A multi-dilated convolution network for speech emotion recognition

Affiliations

A multi-dilated convolution network for speech emotion recognition

Samaneh Madanian et al. Sci Rep. .

Abstract

Speech emotion recognition (SER) is an important application in Affective Computing and Artificial Intelligence. Recently, there has been a significant interest in Deep Neural Networks using speech spectrograms. As the two-dimensional representation of the spectrogram includes more speech characteristics, research interest in convolution neural networks (CNNs) or advanced image recognition models is leveraged to learn deep patterns in a spectrogram to effectively perform SER. Accordingly, in this study, we propose a novel SER model based on the learning of the utterance-level spectrogram. First, we use the Spatial Pyramid Pooling (SPP) strategy to remove the size constraint associated with the CNN-based image recognition task. Then, the SPP layer is deployed to extract both the global-level prominent feature vector and multi-local-level feature vector, followed by an attention model to weigh the feature vectors. Finally, we apply the ArcFace layer, typically used for face recognition, to the SER task, thereby obtaining improved SER performance. Our model achieved an unweighted accuracy of 67.9% on IEMOCAP and 77.6% on EMODB datasets.

Keywords: Convolution neural network; Deep learning; Emotion recognition; Loss layer; Spectrogram; Speech emotion recognition.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interest: The authors declare no competing interest.

Figures

Figure 1
Figure 1
The gridding effect of the dilated CNN structure, adapted from.
Figure 2
Figure 2
The multi-dilated CNN, SPP, and ArcFace SER framework.
Figure 3
Figure 3
The traditional convolution layer versus the dilated convolution layer adapted from.
Figure 4
Figure 4
Multi-dilated CNN blocks.
Figure 5
Figure 5
Global pooling versus SPP.
Figure 6
Figure 6
Performance comparison of 8-layer plain CNN with different SPP schema—training (red line)/validation (blue line).

References

    1. De Silva, U. et al. Clinical decision support using speech signal analysis: Systematic scoping review of neurological disorders. J. Med. Internet Res.27, 63004. 10.2196/63004 (2025). - PMC - PubMed
    1. Liu, S. et al. Speech emotion recognition based on transfer learning from the facenet framework. J. Acoust. Soc. Am.149(2), 1338–1345 (2021). - PubMed
    1. Zhu, L., Chen, L., Zhao, D., Zhou, J. & Zhang, W. Emotion recognition from Chinese speech for smart affective services using a combination of svm and dbn. Sensors17(7), 1694 (2017). - PMC - PubMed
    1. Mekruksavanich, S., Jitpattanakul, A., & Hnoohom, N. Negative emotion recognition using deep learning for thai language. In 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON) 71–74 (IEEE, 2020).
    1. Fahad, M. S., Ranjan, A., Yadav, J. & Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process.110, 102951 (2021).

LinkOut - more resources