Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 23;20(21):6008.
doi: 10.3390/s20216008.

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Affiliations

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Misbah Farooq et al. Sensors (Basel). .

Abstract

Speech emotion recognition (SER) plays a significant role in human-machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Keywords: correlation-based feature selection; deep convolutional neural network; speech emotion recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The framework of our proposed methodology.
Figure 2
Figure 2
Confusion matrix on Emo-DB dataset for speaker-dependent SER.
Figure 3
Figure 3
Confusion matrix of SAVEE dataset for speaker-dependent SER.
Figure 4
Figure 4
Confusion matrix of RAVDESS dataset for speaker-dependent SER.
Figure 5
Figure 5
Confusion matrix of IEMOCAP dataset for speaker-dependent SER.
Figure 6
Figure 6
Confusion matrix of Emo-DB dataset for speaker-independent SER.
Figure 7
Figure 7
Confusion matrix of- SAVEE dataset for speaker-independent SER.
Figure 8
Figure 8
Confusion matrix of RAVDESS dataset for speaker-independent SER.
Figure 9
Figure 9
Confusion matrix of IEMOCAP dataset for speaker-independent SER.

References

    1. El Ayadi M., Kamel M.S., Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011;44:572–587. doi: 10.1016/j.patcog.2010.09.020. - DOI
    1. Abdel-Hamid O., Mohamed A.R., Jiang H., Deng L., Penn G., Yu D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014;22:1533–1545. doi: 10.1109/TASLP.2014.2339736. - DOI
    1. Trigeorgis G., Ringeval F., Brueckner R., Marchi E., Nicolaou M.A., Schuller B., Zafeiriou S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network; Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Shanghai, China. 20–25 March 2016; pp. 5200–5204.
    1. Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks; Proceedings of the Twenty-Sixth Annual Conference on Neural Information Processing Systems (NIPS); Lake Tahoe, NV, USA. 3–8 December 2012; pp. 1097–1105.
    1. Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W.F., Weiss B. A database of German emotional speech; Proceedings of the Ninth European Conference on Speech Communication and Technology; Lisbon, Portugal. 4–8 September 2005.

LinkOut - more resources