Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

Lianzhang Zhu¹, Leiming Chen², Dehai Zhao³, Jiehan Zhou⁴, Weishan Zhang⁵

Affiliations

¹ Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. zhulz@upc.edu.cn.
² Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. chenleiming1930@sina.com.
³ Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. zhaodh.upc@gmail.com.
⁴ Department of Information Processing Science, University of Oulu, Oulu FI-91004, Finland. jiehan.zhou@oulu.fi.
⁵ Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. zhangws@upc.edu.cn.

PMID: 28737705
PMCID: PMC5539696
DOI: 10.3390/s17071694

Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

Lianzhang Zhu et al. Sensors (Basel). 2017.

. 2017 Jul 24;17(7):1694.

doi: 10.3390/s17071694.

Authors

Lianzhang Zhu¹, Leiming Chen², Dehai Zhao³, Jiehan Zhou⁴, Weishan Zhang⁵

Affiliations

¹ Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. zhulz@upc.edu.cn.
² Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. chenleiming1930@sina.com.
³ Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. zhaodh.upc@gmail.com.
⁴ Department of Information Processing Science, University of Oulu, Oulu FI-91004, Finland. jiehan.zhou@oulu.fi.
⁵ Department of Software Engineering, China University of Petroleum, No. 66 Changjiang West Road, Qingdao 266031, China. zhangws@upc.edu.cn.

PMID: 28737705
PMCID: PMC5539696
DOI: 10.3390/s17071694

Abstract

Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed.

Keywords: Deep Belief Networks; speech emotion recognition; speech features; support vector machine.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Process of extracting speech features.

**Figure 2**
Process of extracting Mel-Frequency Cepstral Coefficient (MFCC).

**Figure 3**
Structure of Deep Belief Network (DBN).

**Figure 4**
Structure of combining support vector machine (SVM) and DBN. Speech features are converted into deep features by a pre-trained DBN, which are feature vectors output by the last hidden layer of the DBN. The feature vectors act as the input of SVM and are used to train the SVM. The output of the SVM classifier is the emotion status corresponding to the input speech sample.

See this image and copyright information in PMC

References

1. Lee J., Tashev I. High-level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition; Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association; Dresden, Germany. 6–10 September 2015.
1. Jin Q., Li C., Chen S., Wu H. Speech emotion recognition with acoustic and lexical features; Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brisbane, Australia. 19–24 April 2015; pp. 4749–4753.
1. Wang K.C. Time-Frequency Feature Representation Using Multi-Resolution Texture Analysis and Acoustic Activity Detector for Real-Life Speech Emotion Recognition. Sensors. 2015;15:1458–1478. doi: 10.3390/s150101458. - DOI - PMC - PubMed
1. Li Y., Chao L., Liu Y., Bao W., Tao J. From simulated speech to natural speech, what are the robust features for emotion recognition?; Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII); Xi’an, China. 21–24 September 2015; pp. 368–373.
1. Samantaray A.K., Mahapatra K., Kabi B., Routray A. A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages; Proceedings of the 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS); Kolkata, India. 9–11 July 2015; pp. 372–377.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

Affiliations

Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources