Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 28;20(19):5559.
doi: 10.3390/s20195559.

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Affiliations

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Minji Seo et al. Sensors (Basel). .

Abstract

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Keywords: bag of visual words; convolutional neural network; cross-corpus; log-mel spectrograms; speech emotion recognition; visual attention.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Overall architecture of our pretraining and fine-tuning models for SER.
Figure 2
Figure 2
Architecture of the proposed pretrain VACNN model.
Figure 3
Figure 3
Architecture of the proposed fine-tuned model.

References

    1. Bachmann D., Weichert F., Rinkenauer G. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors. 2018;18:2194. doi: 10.3390/s18072194. - DOI - PMC - PubMed
    1. Akcay M.B., Oguz K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2019;116:56–76. doi: 10.1016/j.specom.2019.12.001. - DOI
    1. Rajan S., Chenniappan P., Devaraj S., Madian N. Facial expression recognition techniques: A comprehensive survey. IET Image Process. 2019;13:1031–1041. doi: 10.1049/iet-ipr.2018.6647. - DOI
    1. Li T.M., Chao H.C., Zhang J. Emotion classification based on brain wave: A survey. Hum. Cent. Comput. Inf. Sci. 2019;9:42. doi: 10.1186/s13673-019-0201-x. - DOI
    1. Minaee S., Abdolrashidi A., Su H., Bennamoun M., Zhang D. Biometric Recognition Using Deep Learning: A survey. arXiv. 20191912.00271

LinkOut - more resources