Prediction and constraint in audiovisual speech perception

Jonathan E Peelle¹, Mitchell S Sommers²

Affiliations

¹ Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA. Electronic address: peellej@ent.wustl.edu.
² Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA.

PMID: 25890390
PMCID: PMC4475441
DOI: 10.1016/j.cortex.2015.03.006

Review

Prediction and constraint in audiovisual speech perception

Jonathan E Peelle et al. Cortex. 2015 Jul.

. 2015 Jul:68:169-81.

doi: 10.1016/j.cortex.2015.03.006. Epub 2015 Mar 20.

Authors

Jonathan E Peelle¹, Mitchell S Sommers²

Affiliations

¹ Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA. Electronic address: peellej@ent.wustl.edu.
² Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA.

PMID: 25890390
PMCID: PMC4475441
DOI: 10.1016/j.cortex.2015.03.006

Abstract

During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration, supported by distinct neuroanatomical mechanisms.

Keywords: Audiovisual speech; Multisensory integration; Predictive coding; Predictive timing; Speech perception.

PubMed Disclaimer

Figures

**Figure 1**
Illustration of lexical neighborhoods based on auditory only, visual only, and combined audiovisual speech information (intersection density), after Tye-Murray et al. (2007b). Auditory competitors differ from a target word by a single phoneme; visual competitors differ from a target word by a single viseme.

**Figure 2**
Models of audiovisual speech perception. (a) A late integration view holds that multimodal integration occurs at a stage after modality-specific inputs have been processed. (b) An early integration view posits that integration happens concurrent with perception. Thus, visual information impacts the processing of auditory cues directly (there is not a pure “auditory only” representation). (c) Hybrid models allow for integration at multiple levels.

**Figure 3**
Types of neural response indicating multimodal integration. (a) Responses to audiovisual speech can be categorized as equivalent to auditory-only, reduced, or enhanced. Various criteria have been used to decide whether non-equivalent responses reflect integration. (b) One frequent approach is to look at whether the response to audiovisual speech is larger than that would be expected by adding the auditory-only and visual-only responses together (characterizing enhanced audiovisual responses as additive, subadditive, or superadditive). (c) A danger of examining only auditory and audiovisual responses is that an apparent audiovisual enhancement may simply reflect a preferential response to visual stimulation. (d) For cases in which enhanced responses are observed, criteria for classifying a response as multisensory include that it be larger than the strongest unimodal response—that is, greater than max(A,V)— or that it be larger than the combined unimodal responses (larger than A+V).

**Figure 4**
Neural oscillations aid perceptual sensitivity. (a) Because oscillatory activity reflects time-varying excitability, stimuli arriving at some oscillatory phases are processed more efficiently than others. (b) Phase-based sensitivity can be examined experimentally by providing current stimulation at different phases of an oscillation (modified from Volgushev, et al., 1998). The phase of low-frequency oscillations affects behavior in numerous paradigms: (c) Reaction times (modified from Lakatos, et al., 2008). (d) Human observer accuracy in a gap detection task (modified from Henry & Obleser, 2012). (e) Low-frequency oscillations in human cortex show phase-locked responses to speech that are enhanced when speech is intelligible (modified from Peelle, et al., 2013).

**Figure 5**
Multistage integration during audiovisual speech perception. (a) Visual information (from nonspecific thalamic inputs or posterior STS) resets the phase of low-frequency oscillations in auditory cortex, increasing perceptual sensitivity. As a result acoustic cues are more salient, reducing confusability. (b) In a complementary fashion, visual speech gestures (e.g., place and manner of articulation) constrain the possible lexical candidates. In auditory-only speech (top), lexical candidates are based purely on auditory information. When visual information is available (bottom), it can act to constrain lexical identity. For example, an open mouth at the end of the word rules out the phonological neighbor “cap”, reducing the amount of lexical competition.

See this image and copyright information in PMC

References

1. Arnal LH, Giraud AL. Cortical oscillations and sensory predictions. Trends inCognitive Sciences. 2012;16:390–398. - PubMed
1. Arnal LH, Wyart V, Giraud AL. Transitions in neural oscillations reflect prediction errors generated in audiovisual speech. Nature Neuroscience. 2011;14:797–801. - PubMed
1. Arnold P, Hill F. Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact. British Journal of Audiology. 2001;92:339–355. - PubMed
1. Auer ET., Jr The influence of the lexicon on speech read word recognition: Contrasting segmental and lexical distinctiveness. Psychonomic Bulletin and Review. 2002;9:341–347. - PubMed
1. Beauchamp MS, Argall BD, Bodurka J, Duyn JH, Martin A. Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nature Neuroscience. 2004;7:1190–1192. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction and constraint in audiovisual speech perception

Affiliations

Prediction and constraint in audiovisual speech perception

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources