A Cantonese Audio-Visual Emotional Speech (CAVES) dataset

Chee Seng Chong¹, Chris Davis¹, Jeesun Kim²

Affiliations

¹ The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia.
² The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia. j.kim@westernsydney.edu.au.

PMID: 38017201
PMCID: PMC11289252
DOI: 10.3758/s13428-023-02270-7

A Cantonese Audio-Visual Emotional Speech (CAVES) dataset

Chee Seng Chong et al. Behav Res Methods. 2024 Aug.

. 2024 Aug;56(5):5264-5278.

doi: 10.3758/s13428-023-02270-7. Epub 2023 Nov 28.

Authors

Chee Seng Chong¹, Chris Davis¹, Jeesun Kim²

Affiliations

¹ The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia.
² The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia. j.kim@westernsydney.edu.au.

PMID: 38017201
PMCID: PMC11289252
DOI: 10.3758/s13428-023-02270-7

Erratum in

Author Correction: A Cantonese Audio-Visual Emotional Speech (CAVES) dataset.
Chong CS, Davis C, Kim J. Chong CS, et al. Behav Res Methods. 2024 Sep;56(6):6410. doi: 10.3758/s13428-023-02318-8. Behav Res Methods. 2024. PMID: 38082116 Free PMC article. No abstract available.

Abstract

We present a Cantonese emotional speech dataset that is suitable for use in research investigating the auditory and visual expression of emotion in tonal languages. This unique dataset consists of auditory and visual recordings of ten native speakers of Cantonese uttering 50 sentences each in the six basic emotions plus neutral (angry, happy, sad, surprise, fear, and disgust). The visual recordings have a full HD resolution of 1920 × 1080 pixels and were recorded at 50 fps. The important features of the dataset are outlined along with the factors considered when compiling the dataset. A validation study of the recorded emotion expressions was conducted in which 15 native Cantonese perceivers completed a forced-choice emotion identification task. The variability of the speakers and the sentences was examined by testing the degree of concordance between the intended and the perceived emotion. We compared these results with those of other emotion perception and evaluation studies that have tested spoken emotions in languages other than Cantonese. The dataset is freely available for research purposes.

Keywords: Auditory and visual expressions; Cantonese dataset; Dataset evaluation; Emotional speech.

PubMed Disclaimer

Figures

**Fig. 1**
The setup in the recording booth showing the camera, screen microphone, lighting and participants’ seat

**Fig. 2**
A single frame extracted from video clip to illustrate the extent to which the video was cropped

**Fig. 3**
Percent accuracy scores for all emotion types by presentation conditions (model-based standard error)

**Fig. 4**
Mean percent correct recognition score for each speaker in the CAVES dataset. *Note.* Female speakers were given identifiers that started with ‘F’ with a number from 1 to 5 to denote each individual speaker. Similarly, males were given identifiers that started with ‘M’

**Fig. 5**
Mean percent correct recognition scores for all 50 sentences across the six emotion types

See this image and copyright information in PMC

References

1. Anolli, L., Wang, L., Mantovani, F., & De Toni, A. (2008). The voice of emotion in Chinese and Italian young adults. Journal of Cross-Cultural Psychology, 39(5), 565–598. 10.1177/002202210832117810.1177/0022022108321178 - DOI
1. Baltrusaitis, T., Zadeh, A., Lim, Y. C., & Morency, L. P. (2018). Openface 2.0: Facial behavior analysis toolkit. In Proceedings of 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (pp. 59–66). IEEE. 10.1109/FG.2018.00019
1. Baveye, Y., Bettinelli, J. N., Dellandréa, E., Chen, L., & Chamaret, C. (2013). A large video dataset for computational models of induced emotion. In Proceedings of 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (pp. 13–18). IEEE. 10.1109/ACII.2013.9
1. Biehl, M., Matsumoto, D., Ekman, P., Hearn, V., Heider, K., Kudoh, T., & Ton, V. (1997). Matsumoto and Ekman's Japanese and Caucasian facial expressions of emotion (JACFEE): Reliability data and cross-national differences. Journal of Nonverbal Behavior, 21(1), 3–21. 10.1023/A:102490250093510.1023/A:1024902500935 - DOI
1. Boersma, P., & Weenink, D. (2014). Praat: Doing Phonetics by Computer. http://www.praat.org/.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Springer
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Cantonese Audio-Visual Emotional Speech (CAVES) dataset

Affiliations

A Cantonese Audio-Visual Emotional Speech (CAVES) dataset

Authors

Affiliations

Erratum in

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous