Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016 Feb;78(2):583-601.
doi: 10.3758/s13414-015-1026-y.

Timing in audiovisual speech perception: A mini review and new psychophysical data

Affiliations
Review

Timing in audiovisual speech perception: A mini review and new psychophysical data

Jonathan H Venezia et al. Atten Percept Psychophys. 2016 Feb.

Abstract

Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.

Keywords: Audiovisual speech; Classification image; McGurk; Multisensory integration; Prediction; Speech kinematics; Timing.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Twenty-five frames from an example Masked-AV stimulus
Masker alpha (transparency) values were spatiotemporally correlated such that only certain frames would be revealed on a given trial. These frames are outlined in red on the example stimulus shown here. Upon close inspection, one can see that the mouth is visible in these frames but not in others. There was a smooth, natural transition between transparency and opacity when movies were presented in real time.
Figure 2
Figure 2. McGurk visual stimulus parameters, calculated following Chandrasekaran et al. (2009)
Pictured are curves showing the visual interlip distance (blue, top) and lip velocity (red, middle). These curves describe the temporal evolution of the visual AKA signal used to construct our McGurk stimuli. Also pictured is the auditory APA waveform used in the SYNC (synchronized) McGurk stimulus. Several features are marked by numbers on the graphs: (1) corresponds to the onset of lip closure during the initial vowel production; (2) corresponds to the point at which the lips were half-way closed at the offset of initial vowel production; (3) corresponds to onset of consonant-related sound energy (3dB up from the trough in the acoustic intensity contour); (4) corresponds to offset of the formant transitions in the acoustic consonant cluster. The time between (2) and (3) is the so-called ‘time to voice.’ The edges of the purple shaded region correspond to (1) and (2). The edges of the green shaded region correspond to (3) and (4). The yellow shaded region shows the time-to-voice. As shown in the upper panel, visual information related to /k/ may be spread across all three regions.
Figure 3
Figure 3. Audiovisual asynchrony in the SYNC McGurk stimulus, calculated following Schwartz and Savariaux (2014)
(Top). Pictured is a trace of the acoustic waveform (black) with the extracted envelope (red). Also pictured are curves showing the visual interlip distance (blue) and the acoustic envelope converted to a decibel scale (green). The time axis is zoomed in on the portion of the stimulus containing the offset of the initial vowel through the onset of the following consonant cluster. Red markers indicate the offset of the initial vowel in the visual (circle, 0.15cm↓) and auditory (square, 3dB↓) signals, and also the onset of the consonant in the visual (circle, 0.15cm↑) and auditory (square, 3dB↑) signals. Audiovisual asynchrony at vowel offset and consonant onset can be calculated by taking the difference (in time) between corresponding markers on the visual and auditory signals. (Bottom). The analysis was repeated the congruent audiovisual AKA from which the visual signal for the McGurk stimulus was drawn.
Figure 4
Figure 4. Results: Group classification movie for SYNC
Fifty example frames from the classification movie for the SYNC McGurk stimulus are displayed. Warm colors mark pixels that contributed significantly to fusion. When these pixels were transparent, fusion was reliably observed. Cool colors mark pixels that showed the opposite effect. When these pixels were transparent, fusion was reliably blocked. Only pixels that survived multiple comparison correction at FDR q < 0.05 are assigned a color.
Figure 5
Figure 5. Results: Group classification time-courses for each McGurk stimulus
The group-mean classification coefficients are shown for each frame in the SYNC (top), V-Lead50 (middle), and V-Lead100 (bottom) McGurk stimuli. Significant frames are labeled with stars. These frames contributed reliably to McGurk fusion. Values close to zero (dotted lines) did not reliably influence perception. The waveform of the auditory signal (gray) for each stimulus is plotted beneath the classification time course (blue).
Figure 6
Figure 6. Classification time-courses for the SYNC, V-Lead50, and V-Lead100 McGurk stimuli (blue) are plotted along with the lip velocity function (red)
The figure is zoomed in on the time period containing frames that contributed significantly to fusion (marked as red circles). Classification time-courses have been normalized (max = 1). The onset of the yellow shaded period corresponds to lip closure following the initial vowel, and the offset corresponds to the onset of consonant-related sound energy (3dB up from the trough in the acoustic envelope). We have labeled this the ‘pre-burst’ visual /k/. Shaded in green is the period containing the auditory consonant /p/ from onset of sound energy to onset of vowel steady state. The green shaded region is shifted appropriately to account for auditory lags in V-Lead50 and V-Lead100. A region on the lip velocity curve is shaded pink. This region corresponds to ‘post-burst’ visual /k/, as estimated from the classification time-courses. Changes in the oral aperture are labeled (black) on the lip velocity function. The ‘release’ point marks the time at which interlip distance (not pictured) increased by 0.15cm from the trough at oral closure (note: the release of the tongue from the velum during the production of /k/ may have occurred at a different time point).

References

    1. Abry C, Lallouache MT, Cathiard MA. Speechreading by humans and machines. Springer; 1996. How can coarticulation models account for speech sensitivity to audio-visual desynchronization? pp. 247–255.
    1. Adams SG, Weismer G, Kent RD. Speaking rate and speech movement velocity profiles. Journal of Speech, Language, and Hearing Research. 1993;36(1):41–54. - PubMed
    1. Ahumada A, Lovell J. Stimulus Features in Signal Detection. The Journal of the Acoustical Society of America. 1971;49(6B):1751–1756. doi: doi: http://dx.doi.org/10.1121/1.1912577. - DOI
    1. Alais D, Burr D. The ventriloquist effect results from near-optimal bimodal integration. Current Biology. 2004;14(3):257–262. - PubMed
    1. Andersson U, Lidestam B. Bottom-up driven speechreading in a speechreading expert: the case of AA (JK023) Ear and hearing. 2005;26(2):214–224. - PubMed

Publication types