A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Cem Doğdu^{1

2

3}, Thomas Kessler¹, Dana Schneider^{1

2

3

4}, Maha Shadaydeh^{2

5}, Stefan R Schweinberger^{2

3

6

7}

Affiliations

¹ Department of Social Psychology, Institute of Psychology, Friedrich Schiller University Jena, Humboldtstraße 26, 07743 Jena, Germany.
² Michael Stifel Center Jena for Data-Driven and Simulation Science, Friedrich Schiller University Jena, 07743 Jena, Germany.
³ Social Potential in Autism Research Unit, Friedrich Schiller University Jena, 07743 Jena, Germany.
⁴ DFG Scientific Network "Understanding Others", 10117 Berlin, Germany.
⁵ Computer Vision Group, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, 07743 Jena, Germany.
⁶ Department of General Psychology and Cognitive Neuroscience, Friedrich Schiller University Jena, Am Steiger 3/Haus 1, 07743 Jena, Germany.
⁷ German Center for Mental Health (DZPG), Site Jena-Magdeburg-Halle, 07743 Jena, Germany.

PMID: 36236658
PMCID: PMC9571288
DOI: 10.3390/s22197561

A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Cem Doğdu et al. Sensors (Basel). 2022.

. 2022 Oct 6;22(19):7561.

doi: 10.3390/s22197561.

Authors

Cem Doğdu^{1

2

3}, Thomas Kessler¹, Dana Schneider^{1

2

3

4}, Maha Shadaydeh^{2

5}, Stefan R Schweinberger^{2

3

6

7}

Affiliations

¹ Department of Social Psychology, Institute of Psychology, Friedrich Schiller University Jena, Humboldtstraße 26, 07743 Jena, Germany.
² Michael Stifel Center Jena for Data-Driven and Simulation Science, Friedrich Schiller University Jena, 07743 Jena, Germany.
³ Social Potential in Autism Research Unit, Friedrich Schiller University Jena, 07743 Jena, Germany.
⁴ DFG Scientific Network "Understanding Others", 10117 Berlin, Germany.
⁵ Computer Vision Group, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, 07743 Jena, Germany.
⁶ Department of General Psychology and Cognitive Neuroscience, Friedrich Schiller University Jena, Am Steiger 3/Haus 1, 07743 Jena, Germany.
⁷ German Center for Mental Health (DZPG), Site Jena-Magdeburg-Halle, 07743 Jena, Germany.

PMID: 36236658
PMCID: PMC9571288
DOI: 10.3390/s22197561

Abstract

Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.

Keywords: emotional speech database; feature set; machine learning; speech; vocal emotion recognition.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Computations for the prediction performance evaluations.

**Figure 2**
Classification performance measures among feature sets. Precision, Recall, AUPRC and AUC values are weighted averages among number of instances of each class in the database. Data bars represent values between 0 and 1. Length of the data bars are determined by the number in each cell.

**Figure 3**
F-measures for each emotion. Color coding indicates performance, with dark green indicating best and dark red indicating poorest performance, and with yellow indicating intermediate classification performance, as shown in the color bar.

**Figure 4**
Confusion Matrices of the Predictions With (a) emobase/MLP, (b) IS-09/SMO, (c) GeMAPS/MLR, (d) eGeMAPS/LOG, (e) IS-09/RF, (f) eGeMAPS/KNN. The x-axis represents the ground truth labels and the y-axis represents predicted labels. Note: Figures give percentages determining the color map but also provide absolute numbers in parentheses to transparently indicate different base frequencies of the predicted emotions. Note also that percentages and numbers are omitted for empty cells to enhance readability.

See this image and copyright information in PMC

References

1. Schuller B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM. 2018;61:90–99. doi: 10.1145/3129340. - DOI
1. Drimalla H., Scheffer T., Landwehr N., Baskow I., Roepke S., Behnia B., Dziobek I. Towards the automatic detection of social biomarkers in autism spectrum disorder: Introducing the simulated interaction task (SIT) Npj Digit. Med. 2020;3:25. doi: 10.1038/s41746-020-0227-5. - DOI - PMC - PubMed
1. Kowallik A.E., Schweinberger S.R. Sensor-Based Technology for Social Information Processing in Autism: A Review. Sensors. 2019;19:4787. doi: 10.3390/s19214787. - DOI - PMC - PubMed
1. Cummins N., Scherer S., Krajewski J., Schnieder S., Epps J., Quatieri T.F. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015;71:10–49. doi: 10.1016/j.specom.2015.03.004. - DOI
1. Dong Y.Z., Yang X.Y. A hierarchical depression detection model based on vocal and emotional cues. Neurocomputing. 2021;441:279–290. doi: 10.1016/j.neucom.2021.02.019. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Affiliations

A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous