Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;87(7):2207-2222.
doi: 10.3758/s13414-025-03072-z. Epub 2025 Apr 30.

Effect of auditory cues to lexical stress on the visual perception of gestural timing

Affiliations

Effect of auditory cues to lexical stress on the visual perception of gestural timing

Chengjia Ye et al. Atten Percept Psychophys. 2025 Oct.

Abstract

Speech is often accompanied by gestures. Since beat gestures-simple nonreferential up-and-down hand movements-frequently co-occur with prosodic prominence, they can indicate stress in a word and hence influence spoken-word recognition. However, little is known about the reverse influence of auditory speech on visual perception. The current study investigated whether lexical stress has an effect on the perceived timing of hand beats. We used videos in which a disyllabic word, embedded in a carrier sentence (Experiment 1) or in isolation (Experiment 2), was coupled with an up-and-down hand beat, while varying their degrees of asynchrony. Results from Experiment 1, a novel beat timing estimation task, revealed that gestures were estimated to occur closer in time to the pitch peak in a stressed syllable than their actual timing, hence reducing the perceived temporal distance between gestures and stress by around 60%. Using a forced-choice task, Experiment 2 further demonstrated that listeners tended to perceive a gesture, falling midway between two syllables, on the syllable receiving stronger cues to stress than the other, and this auditory effect was greater when gestural timing was most ambiguous. Our findings suggest that f0 and intensity are the driving force behind the temporal attraction effect of stress on perceived gestural timing. This study provides new evidence for auditory influences on visual perception, supporting bidirectionality in audiovisual interaction between speech-related signals that occur in everyday face-to-face communication.

Keywords: Audiovisual synchrony; Beat gestures; Psycholinguistics; Speech perception; Temporal processing.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval: Approvals were obtained from the Ethics committee of the Faculty of Social Sciences at Radboud University (project code for Experiment 1: ECSW-LT-2024–1–15–36,673, for Experiment 2: ECSW-LT-2024–4–9–33,816). The procedures used in this study adhere to the tenets of the Declaration of Helsinki. Consent to participate: All participants involved in the study were above 16 years of age and gave informed consent prior to their participation in the experiments. Consent for publication: All participants gave consent for their anonymous experimental data to be published. Conflicts of interest: The authors have no competing interests to declare that are relevant to the content of this article.

Figures

Fig. 1
Fig. 1
Five equally distant frames extracted from the visual stimuli. The third frame was at the end of the carrier sentence in which the speaker raised his right hand to the highest point, marking the end of the preparation phase of a beat gesture. The fourth frame was the gestural apex (the lowest point of the hand beat), functioning as the kinematic landmark in the video. In the last frame, the hand was back to the rest position
Fig. 2
Fig. 2
The four phases in each trial. (a) The preparation phase with trial information (written in Dutch in the experiment) that was ended by pressing the space bar. (b) The video phase, during which the audiovisual video with a particular SOA between video and audio was played once; it ended automatically after the offset of the video. (c) The silence phase that lasted 500 ms, at the beginning of which a red fixation cross was shown at the center. (d) The audio replay phase during which the audio in (b) was played again with the fixation cross remaining on the screen; participants needed to press the space bar to indicate the time they had perceived the beat apex, giving a beat timing estimate. A trial ended after the audio offset. The preparation phase of a new trial (if there was one) was then presented on the screen. (Color figure online)
Fig. 3
Fig. 3
Distribution of 2,994 measurements of time difference over all SOA steps. The solid orange dots depict the means at each SOA step across the two stress patterns presented in pairs. Words with stress on the first syllable are marked in dark blue and appear on the left, while words with stress on the second syllable are marked in light green and appear on the right. Responses with y = 0 time difference mean that a participant pressed the space bar during the audio replay at exactly the same time as the gestural apex was presented in the preceding video, and hence, no attraction. The more distant a response is from the y = 0 line, the stronger the attraction effect. Responses falling above the y = 0 line reflect a forward attraction so that the gestural apex was perceived later than its actual timing; those below reflect a backward attraction so that the apex was perceived earlier than its actual timing. The downward slope of the orange line reflects the overall magnitude and direction of the attraction effect. The dashed purple line y =  − x in the background implies hypothetical complete attraction for reference; this illustrates space-bar presses at the time point of the acoustic pitch peak instead of the beat apex. (Color figure online)
Fig. 4
Fig. 4
The seven steps of co-varying f0 (a; in Hz) and intensity (b; in dB) cues to lexical stress. Step 1 (purple) was the duration-controlled trochaic word VOORnaam and Step 7 (yellow) was the iambic word voorNAAM. Steps 2–6 were interpolated intermediate levels of equal distance. Note that Steps 2 and 6 were not used in the experiment. Panel (c) illustrates the oscillograms of Steps 1 and 7, with the grey dashed line x = 319 ms showing the syllable boundary between the two syllables. The scales of time (the x-axes) of all three panels are the same. (Color figure online)
Fig. 5
Fig. 5
The line plot of the percentage of responses with beat apex perceived on the second syllable naam over nine visual steps of the timing of beat apex. Vertical differences between colored lines indicate the auditory attraction effect. All visual steps were between the pitch peaks in the two syllables of the disyllabic word voornaam in the audio. The distance between two adjacent steps was 34 ms. Visual Step 1 (V1), the pitch peak of voor, was at 230 ms in the audio and Step 9 (V9), the pitch peak of naam, was at 499 ms. The horizontal grey dashed line (y = 50) splits the plot into an upper and a lower part. Responses above this line were biased towards perceiving the beat apex on the second syllable, whereas those below were biased towards the first syllable. The vertical grey dashed line (x = 3.68) between Visual Steps 3 (V3) and 4 (V4) indicates the boundary between the two syllables, which was at 319 ms in the audio; it was 22 ms after V3 and 12 ms before V4. (Color figure online)

Similar articles

References

    1. Alais, D., & Carlile, S. (2005). Synchronizing to real events: Subjective audiovisual alignment scales with perceived auditory depth and speed of sound. Proceedings of the National Academy of Sciences,102(6), 2244–2247. 10.1073/pnas.0407034102 - PMC - PubMed
    1. Baart, M., & Vroomen, J. (2010). Do you see what you are hearing? Cross-modal effects of speech sounds on lipreading. Neuroscience Letters,471(2), 100–103. 10.1016/j.neulet.2010.01.019 - PubMed
    1. Bertelson, P., & Aschersleben, G. (2003). Temporal ventriloquism: Crossmodal interaction on the time dimension. International Journal of Psychophysiology,50(1/2), 147–155. 10.1016/S0167-8760(03)00130-2 - PubMed
    1. Boersma, P., & Weenink, D. (2024). Praat: Doing phonetics by computer (Version 6.4.04) [Computer software]. http://www.praat.org/
    1. Bosker, H. R., & Peeters, D. (2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B,288, 1–9. 10.1098/rspb.2020.2419 - PMC - PubMed

LinkOut - more resources