Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 21;18(11):e0291500.
doi: 10.1371/journal.pone.0291500. eCollection 2023.

Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest

Affiliations

Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest

Mohammad Mahdi Rezapour Mashhadi et al. PLoS One. .

Abstract

Speech is a direct and rich way of transmitting information and emotions from one point to another. In this study, we aimed to classify different emotions in speech using various audio features and machine learning models. We extracted various types of audio features such as Mel-frequency cepstral coefficients, chromogram, Mel-scale spectrogram, spectral contrast feature, Tonnetz representation and zero-crossing rate. We used a limited dataset of speech emotion recognition (SER) and augmented it with additional audios. In addition, In contrast to many previous studies, we combined all audio files together before conducting our analysis. We compared the performance of two models: one-dimensional convolutional neural network (conv1D) and random forest (RF), with RF-based feature selection. Our results showed that RF with feature selection achieved higher average accuracy (69%) than conv1D and had the highest precision for fear (72%) and the highest recall for calm (84%). Our study demonstrates the effectiveness of RF with feature selection for speech emotion classification using a limited dataset. We found for both algorithms, anger is misclassified mostly with happy, disgust with sad and neutral, and fear with sad. This could be due to the similarity of some acoustic features between these emotions, such as pitch, intensity, and tempo.

PubMed Disclaimer

Conflict of interest statement

There is no competing interest

Figures

Fig 1
Fig 1. Audio waveplot and spectrogram of fear and anger emotions.
Fig 2
Fig 2. Methodological steps for audio classification.
Fig 3
Fig 3. Comparison of CONV1D and random forest models for audio emotion classification.

References

    1. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, et al., editors. CNN architectures for large-scale audio classification. 2017 ieee international conference on acoustics, speech and signal processing (icassp); 2017: IEEE.
    1. Lu L, Zhang H-J, Jiang H. Content analysis for audio classification and segmentation. IEEE Transactions on speech and audio processing. 2002;10(7):504–16.
    1. Abu-El-Quran AR, Goubran RA, Chan AD. Security monitoring using microphone arrays and audio classification. IEEE Transactions on Instrumentation and Measurement. 2006;55(4):1025–32.
    1. Berenzweig AL, Ellis DP, Lawrence S, editors. Using voice segments to improve artist classification of music. Audio Engineering Society Conference: 22nd International Conference: Virtual, Synthetic, and Entertainment Audio; 2002: Audio Engineering Society.
    1. Sardari S, Nakisa B, Rastgoo MN, Eklund P. Audio based depression detection using Convolutional Autoencoder. Expert Systems with Applications. 2022;189:116076.