Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep;56(6):5693-5708.
doi: 10.3758/s13428-023-02300-4. Epub 2023 Dec 13.

Ecologically valid speech collection in behavioral research: The Ghent Semi-spontaneous Speech Paradigm (GSSP)

Affiliations

Ecologically valid speech collection in behavioral research: The Ghent Semi-spontaneous Speech Paradigm (GSSP)

Jonas Van Der Donckt et al. Behav Res Methods. 2024 Sep.

Abstract

This paper introduces the Ghent Semi-spontaneous Speech Paradigm (GSSP), a new method for collecting unscripted speech data for affective-behavioral research in both experimental and real-world settings through the description of peer-rated pictures with a consistent affective load. The GSSP was designed to meet five criteria: (1) allow flexible speech recording durations, (2) provide a straightforward and non-interfering task, (3) allow for experimental control, (4) favor spontaneous speech for its prosodic richness, and (5) require minimal human interference to enable scalability. The validity of the GSSP was evaluated through an online task, in which this paradigm was implemented alongside a fixed-text read-aloud task. The results indicate that participants were able to describe images with an adequate duration, and acoustic analysis demonstrated a trend for most features in line with the targeted speech styles (i.e., unscripted spontaneous speech versus scripted read-aloud speech). A speech style classification model using acoustic features achieved a balanced accuracy of 83% on within-dataset validation, indicating separability between the GSSP and read-aloud speech task. Furthermore, when validating this model on an external dataset that contains interview and read-aloud speech, a balanced accuracy score of 70% is obtained, indicating an acoustic correspondence between the GSSP speech and spontaneous interviewee speech. The GSSP is of special interest for behavioral and speech researchers looking to capture spontaneous speech, both in longitudinal ambulatory behavioral studies and laboratory studies. To facilitate future research on speech styles, acoustics, and affective states, the task implementation code, the collected dataset, and analysis notebooks are available.

Keywords: Acoustics; Behavioral research; Experimental research; Machine learning; Psycholinguistics; Speech; Speech collection; Speech styles.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Flowchart of the web application experiment. Note. This results in 7 Marloes, 15 Radboud, and 15 PiSCES utterances per participant
Fig. 2
Fig. 2
Trial flow chart of the web app speech collection task, with the pages translated to English. First, an empty page (a) is displayed with an enabled start button and a disabled stop button. When the participant clicks the start button, (b) the audio recording begins, the stop button will be enabled. The stimulus in the form of an image (or text for the read-aloud task) is being presented. After the participant completes the stimulus speech collection task, he/she or they click on the stop button, triggering the redirection to (c), where the participant reports their experienced arousal and valence values
Fig. 3
Fig. 3
Audio data processing flowchart
Fig. 4
Fig. 4
VAD slicing with a 0.25 s margin for the first and last voiced segment. Note. The first voiced regions occur approximately 2 seconds after the participant pressed the “start” button. The slicing ensures that each participant's first/last voiced segment start/end at the same time, allowing to make fair comparisons on fixed-duration excerpts relative from the VAD-slice beginning or end
Fig. 5
Fig. 5
Distribution plot of the VAD-sliced utterance durations. The vertical dashed lines on the left indicate the voiced duration threshold (15 seconds) and the lines on the right represent the instructed image description duration (30 seconds)
Fig. 6
Fig. 6
Box plot of temporal features, grouped by collection task (row 1) and speech style (row 2)
Fig. 7
Fig. 7
Box plot of frequency-related features, grouped by task (row 1) and speech style (row 2)
Fig. 8
Fig. 8
Box plot of amplitude-related features, grouped by task (row 1) and speech style (row 2)
Fig. 9
Fig. 9
Picture delta box plot of a subset of openSmile features for both the PiSCES (column 1) and Radboud (column 2) image sets. The deltas are calculated by subtracting each value from the participant’s mean for the same DB set
Fig. 10
Fig. 10
Two-dimensional t-SNE projection of ECAPA-TDNN utterance embeddings. (a) Hue determined by speaker ID. (b) Hue determined by speech style. Note. Each marker represents one speech utterance and, as illustrated by (a), each cluster of markers represents utterances by one speaker. When visualizing the colors of each dot based on its speech (trial) style (b), we see that generally the individual speech styles cluster together within each speaker's utterances. This hints towards a separability of speech styles based on speaker identification techniques using acoustic properties

References

    1. Baird, A., Amiriparian, S., Cummins, N., Sturmbauer, S., Janson, J., Messner, E.-M., Baumeister, H., Rohleder, N., & Schuller, B. W. (2019). Using Speech to Predict Sequentially Measured Cortisol Levels During a Trier Social Stress Test. Interspeech 2019, 534–538. 10.21437/Interspeech.2019-1352
    1. Baird, A., Triantafyllopoulos, A., Zänkert, S., Ottl, S., Christ, L., Stappen, L., Konzok, J., Sturmbauer, S., Meßner, E.-M., Kudielka, B. M., Rohleder, N., Baumeister, H., & Schuller, B. W. (2021). An Evaluation of Speech-Based Recognition of Emotional and Physiological Markers of Stress. Frontiers in Computer Science, 3, 750284. 10.3389/fcomp.2021.75028410.3389/fcomp.2021.750284 - DOI
    1. Barik, H. C. (1977). Cross-Linguistic Study of Temporal Characteristics of Different Types of Speech Materials. Language and Speech, 20(2), 116–126. 10.1177/002383097702000203 10.1177/002383097702000203 - DOI - PubMed
    1. Batliner, A., Kompe, R., Kießling, A., Nöth, E., & Niemann, H. (1995). Can You Tell Apart Spontaneous and Read Speech if You Just Look at Prosody? In A. J. R. Ayuso & J. M. L. Soler (Eds.), Speech Recognition and Coding (pp. 321–324). Springer. 10.1007/978-3-642-57745-1_47
    1. Blaauw, Eleneora. (1992). Phonetic differences between read and spontaneous speech. Accessed May 2023, https://www.isca-speech.org/archive_v0/archive_papers/icslp_1992/i92_075...

Publication types

LinkOut - more resources