Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Minji Seo¹, Myungho Kim¹

Affiliations

PMID: 32998382
PMCID: PMC7583996
DOI: 10.3390/s20195559

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Minji Seo et al. Sensors (Basel). 2020.

. 2020 Sep 28;20(19):5559.

doi: 10.3390/s20195559.

Authors

Minji Seo¹, Myungho Kim¹

Affiliation

¹ Department of Software Convergence, Soongsil University, 369, Sangdo-ro, Dongjak-gu, Seoul 06978, Korea.

PMID: 32998382
PMCID: PMC7583996
DOI: 10.3390/s20195559

Abstract

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Keywords: bag of visual words; convolutional neural network; cross-corpus; log-mel spectrograms; speech emotion recognition; visual attention.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Overall architecture of our pretraining and fine-tuning models for SER.

**Figure 2**
Architecture of the proposed pretrain VACNN model.

**Figure 3**
Architecture of the proposed fine-tuned model.

See this image and copyright information in PMC

References

1. Bachmann D., Weichert F., Rinkenauer G. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors. 2018;18:2194. doi: 10.3390/s18072194. - DOI - PMC - PubMed
1. Akcay M.B., Oguz K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2019;116:56–76. doi: 10.1016/j.specom.2019.12.001. - DOI
1. Rajan S., Chenniappan P., Devaraj S., Madian N. Facial expression recognition techniques: A comprehensive survey. IET Image Process. 2019;13:1031–1041. doi: 10.1049/iet-ipr.2018.6647. - DOI
1. Li T.M., Chao H.C., Zhang J. Emotion classification based on brain wave: A survey. Hum. Cent. Comput. Inf. Sci. 2019;9:42. doi: 10.1186/s13673-019-0201-x. - DOI
1. Minaee S., Abdolrashidi A., Su H., Bennamoun M., Zhang D. Biometric Recognition Using Deep Learning: A survey. arXiv. 20191912.00271

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

IITP-2020-2018-0-01419/Institute for Information and Communications Technology Planning and Evaluation

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Affiliation

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources