Effect of auditory cues to lexical stress on the visual perception of gestural timing

Chengjia Ye¹, James M McQueen^{2

3}, Hans Rutger Bosker²

Affiliations

¹ Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas Van Aquinostraat 4, 6525 GD, Nijmegen, The Netherlands. chengjia.ye@donders.ru.nl.
² Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas Van Aquinostraat 4, 6525 GD, Nijmegen, The Netherlands.
³ Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands.

PMID: 40307538
PMCID: PMC12331874
DOI: 10.3758/s13414-025-03072-z

Effect of auditory cues to lexical stress on the visual perception of gestural timing

Chengjia Ye et al. Atten Percept Psychophys. 2025 Oct.

. 2025 Oct;87(7):2207-2222.

doi: 10.3758/s13414-025-03072-z. Epub 2025 Apr 30.

Authors

Chengjia Ye¹, James M McQueen^{2

3}, Hans Rutger Bosker²

Affiliations

¹ Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas Van Aquinostraat 4, 6525 GD, Nijmegen, The Netherlands. chengjia.ye@donders.ru.nl.
² Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas Van Aquinostraat 4, 6525 GD, Nijmegen, The Netherlands.
³ Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands.

PMID: 40307538
PMCID: PMC12331874
DOI: 10.3758/s13414-025-03072-z

Abstract

Speech is often accompanied by gestures. Since beat gestures-simple nonreferential up-and-down hand movements-frequently co-occur with prosodic prominence, they can indicate stress in a word and hence influence spoken-word recognition. However, little is known about the reverse influence of auditory speech on visual perception. The current study investigated whether lexical stress has an effect on the perceived timing of hand beats. We used videos in which a disyllabic word, embedded in a carrier sentence (Experiment 1) or in isolation (Experiment 2), was coupled with an up-and-down hand beat, while varying their degrees of asynchrony. Results from Experiment 1, a novel beat timing estimation task, revealed that gestures were estimated to occur closer in time to the pitch peak in a stressed syllable than their actual timing, hence reducing the perceived temporal distance between gestures and stress by around 60%. Using a forced-choice task, Experiment 2 further demonstrated that listeners tended to perceive a gesture, falling midway between two syllables, on the syllable receiving stronger cues to stress than the other, and this auditory effect was greater when gestural timing was most ambiguous. Our findings suggest that f0 and intensity are the driving force behind the temporal attraction effect of stress on perceived gestural timing. This study provides new evidence for auditory influences on visual perception, supporting bidirectionality in audiovisual interaction between speech-related signals that occur in everyday face-to-face communication.

Keywords: Audiovisual synchrony; Beat gestures; Psycholinguistics; Speech perception; Temporal processing.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval: Approvals were obtained from the Ethics committee of the Faculty of Social Sciences at Radboud University (project code for Experiment 1: ECSW-LT-2024–1–15–36,673, for Experiment 2: ECSW-LT-2024–4–9–33,816). The procedures used in this study adhere to the tenets of the Declaration of Helsinki. Consent to participate: All participants involved in the study were above 16 years of age and gave informed consent prior to their participation in the experiments. Consent for publication: All participants gave consent for their anonymous experimental data to be published. Conflicts of interest: The authors have no competing interests to declare that are relevant to the content of this article.

Figures

**Fig. 1**
Five equally distant frames extracted from the visual stimuli. The third frame was at the end of the carrier sentence in which the speaker raised his right hand to the highest point, marking the end of the preparation phase of a beat gesture. The fourth frame was the gestural apex (the lowest point of the hand beat), functioning as the kinematic landmark in the video. In the last frame, the hand was back to the rest position

**Fig. 2**
The four phases in each trial. **(a)** The *preparation* phase with trial information (written in Dutch in the experiment) that was ended by pressing the space bar. **(b)** The *video* phase, during which the audiovisual video with a particular SOA between video and audio was played once; it ended automatically after the offset of the video. **(c)** The *silence* phase that lasted 500 ms, at the beginning of which a red fixation cross was shown at the center. **(d)** The *audio replay* phase during which the audio in (b) was played again with the fixation cross remaining on the screen; participants needed to press the space bar to indicate the time they had perceived the beat apex, giving a beat timing estimate. A trial ended after the audio offset. The preparation phase of a new trial (if there was one) was then presented on the screen. (Color figure online)

**Fig. 3**
Distribution of 2,994 measurements of time difference over all SOA steps. The solid orange dots depict the means at each SOA step across the two stress patterns presented in pairs. Words with stress on the first syllable are marked in dark blue and appear on the left, while words with stress on the second syllable are marked in light green and appear on the right. Responses with y = 0 time difference mean that a participant pressed the space bar during the audio replay at exactly the same time as the gestural apex was presented in the preceding video, and hence, no attraction. The more distant a response is from the y = 0 line, the stronger the attraction effect. Responses falling above the y = 0 line reflect a forward attraction so that the gestural apex was perceived later than its actual timing; those below reflect a backward attraction so that the apex was perceived earlier than its actual timing. The downward slope of the orange line reflects the overall magnitude and direction of the attraction effect. The dashed purple line y = *− x* in the background implies hypothetical complete attraction for reference; this illustrates space-bar presses at the time point of the acoustic pitch peak instead of the beat apex. (Color figure online)

**Fig. 4**
The seven steps of co-varying f0 (a; in Hz) and intensity (b; in dB) cues to lexical stress. Step 1 (purple) was the duration-controlled trochaic word *VOORnaam* and Step 7 (yellow) was the iambic word *voorNAAM*. Steps 2–6 were interpolated intermediate levels of equal distance. Note that Steps 2 and 6 were not used in the experiment. Panel **(c)** illustrates the oscillograms of Steps 1 and 7, with the grey dashed line x = 319 ms showing the syllable boundary between the two syllables. The scales of time (the x-axes) of all three panels are the same. (Color figure online)

**Fig. 5**
The line plot of the percentage of responses with beat apex perceived on the second syllable *naam* over nine visual steps of the timing of beat apex. Vertical differences between colored lines indicate the auditory attraction effect. All visual steps were between the pitch peaks in the two syllables of the disyllabic word *voornaam* in the audio. The distance between two adjacent steps was 34 ms. Visual Step 1 (V1), the pitch peak of *voor*, was at 230 ms in the audio and Step 9 (V9), the pitch peak of *naam*, was at 499 ms. The horizontal grey dashed line (y = 50) splits the plot into an upper and a lower part. Responses above this line were biased towards perceiving the beat apex on the second syllable, whereas those below were biased towards the first syllable. The vertical grey dashed line (x = 3.68) between Visual Steps 3 (V3) and 4 (V4) indicates the boundary between the two syllables, which was at 319 ms in the audio; it was 22 ms after V3 and 12 ms before V4. (Color figure online)

See this image and copyright information in PMC

References

1. Alais, D., & Carlile, S. (2005). Synchronizing to real events: Subjective audiovisual alignment scales with perceived auditory depth and speed of sound. Proceedings of the National Academy of Sciences,102(6), 2244–2247. 10.1073/pnas.0407034102 - PMC - PubMed
1. Baart, M., & Vroomen, J. (2010). Do you see what you are hearing? Cross-modal effects of speech sounds on lipreading. Neuroscience Letters,471(2), 100–103. 10.1016/j.neulet.2010.01.019 - PubMed
1. Bertelson, P., & Aschersleben, G. (2003). Temporal ventriloquism: Crossmodal interaction on the time dimension. International Journal of Psychophysiology,50(1/2), 147–155. 10.1016/S0167-8760(03)00130-2 - PubMed
1. Boersma, P., & Weenink, D. (2024). Praat: Doing phonetics by computer (Version 6.4.04) [Computer software]. http://www.praat.org/
1. Bosker, H. R., & Peeters, D. (2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B,288, 1–9. 10.1098/rspb.2020.2419 - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Springer

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Effect of auditory cues to lexical stress on the visual perception of gestural timing

Affiliations

Effect of auditory cues to lexical stress on the visual perception of gestural timing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources