Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 3:7:e766.
doi: 10.7717/peerj-cs.766. eCollection 2021.

Effect on speech emotion classification of a feature selection approach using a convolutional neural network

Affiliations

Effect on speech emotion classification of a feature selection approach using a convolutional neural network

Ammar Amjad et al. PeerJ Comput Sci. .

Abstract

Speech emotion recognition (SER) is a challenging issue because it is not clear which features are effective for classification. Emotionally related features are always extracted from speech signals for emotional classification. Handcrafted features are mainly used for emotional identification from audio signals. However, these features are not sufficient to correctly identify the emotional state of the speaker. The advantages of a deep convolutional neural network (DCNN) are investigated in the proposed work. A pretrained framework is used to extract the features from speech emotion databases. In this work, we adopt the feature selection (FS) approach to find the discriminative and most important features for SER. Many algorithms are used for the emotion classification problem. We use the random forest (RF), decision tree (DT), support vector machine (SVM), multilayer perceptron classifier (MLP), and k-nearest neighbors (KNN) to classify seven emotions. All experiments are performed by utilizing four different publicly accessible databases. Our method obtains accuracies of 92.02%, 88.77%, 93.61%, and 77.23% for Emo-DB, SAVEE, RAVDESS, and IEMOCAP, respectively, for speaker-dependent (SD) recognition with the feature selection method. Furthermore, compared to current handcrafted feature-based SER methods, the proposed method shows the best results for speaker-independent SER. For EMO-DB, all classifiers attain an accuracy of more than 80% with or without the feature selection technique.

Keywords: Convolutional neural network; Data augmentation; Feature extraction; Feature selection; Mel-spectrogram; Speech emotion recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. The structure of our proposed model for audio emotion recognition.
Figure 2
Figure 2. The general architecture of AlexNet, The parameters of the convolutional layer are represented by the “Conv(kernel size)-[stride size]-[number of channels]”.
The parameters of themax-pooling layer are indicated as “Maxpool-[kernel size]-[stride size]”.
Figure 3
Figure 3. Confusion matrix obtained by the SVM on the Emo-DB database for the SD experiment.
Figure 4
Figure 4. Confusion matrix obtained by the SVM on the SAVEE database for the SD experiment.
Figure 5
Figure 5. Confusion matrix obtained by the SVM on the RAVDESS database for the SD experiment.
Figure 6
Figure 6. Confusion matrix obtained by the MLP on the IEMOCAP database for the SD experiment.
Figure 7
Figure 7. Confusion matrix obtained by the SVM on the RAVDESS database for the SI experiment.
Figure 8
Figure 8. Confusion matrix obtained by the MLP on the RAVDESS database for the SI experiment.
Figure 9
Figure 9. Confusion matrix obtained by the SVM on the IEMOCAP database for the SI experiment.

Similar articles

Cited by

References

    1. Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2014;22(10):1533–1545. doi: 10.1109/TASLP.2014.2339736. - DOI
    1. Alonso JB, Cabrera J, Medina M, Travieso CM. New approach in quantification of emotional intensity from the speech signal: emotional temperature. Expert Systems with Applications. 2015;42(24):9554–9564. doi: 10.1016/j.eswa.2015.07.062. - DOI
    1. Alreshidi A, Ullah M. Facial emotion recognition using hybrid features. Informatics. 2020;7(1):6. doi: 10.3390/informatics7010006. - DOI
    1. Anagnostopoulos C-N, Iliou T, Giannoukos I. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review. 2015;43(2):155–177. doi: 10.1007/s10462-012-9368-5. - DOI
    1. Badshah AM, Ahmad J, Rahim N, Baik SW. Speech emotion recognition from spectrograms with deep convolutional neural network. 2017 International Conference on Platform Technology and Service (PlatCon); 2017. pp. 1–5.

LinkOut - more resources