Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 19;16(2):e0246986.
doi: 10.1371/journal.pone.0246986. eCollection 2021.

Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Affiliations

Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Alma Lindborg et al. PLoS One. .

Abstract

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. "ba" dubbed onto a visual stimulus such as "ga" produces the illusion of hearing "da". Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. The Joint Prior model of audiovisual speech perception.
Upper row: Example plots of prior, likelihood and posterior distributions. The horizontal axes represent the auditory dimension and the vertical dimension represents the visual dimension. The prior is a Gaussian ridge along the A = V diagonal, and the likelihood is a Gaussian (here depicted with greater variance in the visual dimension). The posterior distribution is also Gaussian, pulled in the direction of the A = V diagonal. Lower row: the marginal distribution of the prior, likelihood and posterior in the auditory dimension. Response boundaries (vertical lines) are applied to the posterior distribution and response probabilities are estimated as the probability mass (yellow area) delimited by the response boundaries.
Fig 2
Fig 2. Prior structures.
Illustration of the prior structure of each model compared in the study. A full derivation of the Joint Prior model of audiovisual speech perception is available in the supporting information.
Fig 3
Fig 3. Behavioural responses and model predictions.
Mean behavioural responses (dark bars) and model predictions (light bars) to visual-only (top row), auditory-only (left column) and audiovisual stimuli (central panels) for 16 participants. Error bars represent the standard error of the mean. Visual stimuli are divided into G (left compartment) and B (right compartment) and are presented with descending SNR (left: high SNR to right: low SNR within each compartment. Auditory stimuli are divided into B (top compartment) and G (bottom compartment) and are presented with descending SNR (top: high SNR to bottom: low SNR within each compartment). Each audiovisual stimulus is a combination of the auditory and aisual stimulus on the corresponding row and column, presented either in synchrony (blue bars) or out of sync (red bars). The model predictions displayed are cross-validation predictions from the Reduced Joint Prior model.
Fig 4
Fig 4. Modelling results.
A) Prior parameters for synchronous and asynchronous stimuli: binding parameter (0 = full binding, infinite = no binding) for the Full Joint Prior model, and probability of separate causes (0 = full binding, 1 = no binding) for the Full BCI model. B) Auditory and C) visual precision parameters of the Reduced Joint Prior and BCI for clear to noisy stimuli (left to right). The images depict the first author. Error bars represent SEM. D) Improvement in test error over baseline (the Maximum likelihood model) for the Reduced and Full Bayesian model implementations. Error bars represent SEM. E) Auditory weight in the Reduced Joint Prior model, plotted by SNR and SOA.

References

    1. Sumby WH, Pollack I. Visual Contribution to Speech Intelligibility in Noise. J Acoust Soc Am. 1954;26: 212–215. 10.1121/1.1907309 - DOI
    1. van Wassenhove V, Grant KW, Poeppel D. Visual speech speeds up the neural processing of auditory speech. Proc Natl Acad Sci. 2005;102: 1181–1186. 10.1073/pnas.0408949102 - DOI - PMC - PubMed
    1. Massaro DW. Perceiving talking faces: from speech perception to a behavioral principle. Cambridge, Mass: MIT Press; 1998.
    1. Andersen TS. The early maximum likelihood estimation model of audiovisual integration in speech perception. J Acoust Soc Am. 2015;137: 2884–2891. 10.1121/1.4916691 - DOI - PubMed
    1. Ernst MO, Banks MS. Humans integrate visual and haptic information in a statistically optimal fashion. Nature. 2002;415: 429–433. 10.1038/415429a - DOI - PubMed

LinkOut - more resources