Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 6;22(19):7561.
doi: 10.3390/s22197561.

A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Affiliations

A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Cem Doğdu et al. Sensors (Basel). .

Abstract

Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.

Keywords: emotional speech database; feature set; machine learning; speech; vocal emotion recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Computations for the prediction performance evaluations.
Figure 2
Figure 2
Classification performance measures among feature sets. Precision, Recall, AUPRC and AUC values are weighted averages among number of instances of each class in the database. Data bars represent values between 0 and 1. Length of the data bars are determined by the number in each cell.
Figure 3
Figure 3
F-measures for each emotion. Color coding indicates performance, with dark green indicating best and dark red indicating poorest performance, and with yellow indicating intermediate classification performance, as shown in the color bar.
Figure 4
Figure 4
Confusion Matrices of the Predictions With (a) emobase/MLP, (b) IS-09/SMO, (c) GeMAPS/MLR, (d) eGeMAPS/LOG, (e) IS-09/RF, (f) eGeMAPS/KNN. The x-axis represents the ground truth labels and the y-axis represents predicted labels. Note: Figures give percentages determining the color map but also provide absolute numbers in parentheses to transparently indicate different base frequencies of the predicted emotions. Note also that percentages and numbers are omitted for empty cells to enhance readability.

References

    1. Schuller B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM. 2018;61:90–99. doi: 10.1145/3129340. - DOI
    1. Drimalla H., Scheffer T., Landwehr N., Baskow I., Roepke S., Behnia B., Dziobek I. Towards the automatic detection of social biomarkers in autism spectrum disorder: Introducing the simulated interaction task (SIT) Npj Digit. Med. 2020;3:25. doi: 10.1038/s41746-020-0227-5. - DOI - PMC - PubMed
    1. Kowallik A.E., Schweinberger S.R. Sensor-Based Technology for Social Information Processing in Autism: A Review. Sensors. 2019;19:4787. doi: 10.3390/s19214787. - DOI - PMC - PubMed
    1. Cummins N., Scherer S., Krajewski J., Schnieder S., Epps J., Quatieri T.F. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015;71:10–49. doi: 10.1016/j.specom.2015.03.004. - DOI
    1. Dong Y.Z., Yang X.Y. A hierarchical depression detection model based on vocal and emotional cues. Neurocomputing. 2021;441:279–290. doi: 10.1016/j.neucom.2021.02.019. - DOI