Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Alma Lindborg^{1

2}, Tobias S Andersen²

Affiliations

¹ Department of Psychology, University of Potsdam, Potsdam, Germany.
² Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark.

PMID: 33606815
PMCID: PMC7895372
DOI: 10.1371/journal.pone.0246986

Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Alma Lindborg et al. PLoS One. 2021.

. 2021 Feb 19;16(2):e0246986.

doi: 10.1371/journal.pone.0246986. eCollection 2021.

Authors

Alma Lindborg^{1

2}, Tobias S Andersen²

Affiliations

¹ Department of Psychology, University of Potsdam, Potsdam, Germany.
² Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark.

PMID: 33606815
PMCID: PMC7895372
DOI: 10.1371/journal.pone.0246986

Abstract

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. "ba" dubbed onto a visual stimulus such as "ga" produces the illusion of hearing "da". Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. The Joint Prior model of audiovisual speech perception.**
Upper row: Example plots of prior, likelihood and posterior distributions. The horizontal axes represent the auditory dimension and the vertical dimension represents the visual dimension. The prior is a Gaussian ridge along the A = V diagonal, and the likelihood is a Gaussian (here depicted with greater variance in the visual dimension). The posterior distribution is also Gaussian, pulled in the direction of the A = V diagonal. Lower row: the marginal distribution of the prior, likelihood and posterior in the auditory dimension. Response boundaries (vertical lines) are applied to the posterior distribution and response probabilities are estimated as the probability mass (yellow area) delimited by the response boundaries.

**Fig 2. Prior structures.**
Illustration of the prior structure of each model compared in the study. A full derivation of the Joint Prior model of audiovisual speech perception is available in the supporting information.

**Fig 3. Behavioural responses and model predictions.**
Mean behavioural responses (dark bars) and model predictions (light bars) to visual-only (top row), auditory-only (left column) and audiovisual stimuli (central panels) for 16 participants. Error bars represent the standard error of the mean. Visual stimuli are divided into G (left compartment) and B (right compartment) and are presented with descending SNR (left: high SNR to right: low SNR within each compartment. Auditory stimuli are divided into B (top compartment) and G (bottom compartment) and are presented with descending SNR (top: high SNR to bottom: low SNR within each compartment). Each audiovisual stimulus is a combination of the auditory and aisual stimulus on the corresponding row and column, presented either in synchrony (blue bars) or out of sync (red bars). The model predictions displayed are cross-validation predictions from the Reduced Joint Prior model.

**Fig 4. Modelling results.**
A) Prior parameters for synchronous and asynchronous stimuli: binding parameter (0 = full binding, infinite = no binding) for the Full Joint Prior model, and probability of separate causes (0 = full binding, 1 = no binding) for the Full BCI model. B) Auditory and C) visual precision parameters of the Reduced Joint Prior and BCI for clear to noisy stimuli (left to right). The images depict the first author. Error bars represent SEM. D) Improvement in test error over baseline (the Maximum likelihood model) for the Reduced and Full Bayesian model implementations. Error bars represent SEM. E) Auditory weight in the Reduced Joint Prior model, plotted by SNR and SOA.

See this image and copyright information in PMC

References

1. Sumby WH, Pollack I. Visual Contribution to Speech Intelligibility in Noise. J Acoust Soc Am. 1954;26: 212–215. 10.1121/1.1907309 - DOI
1. van Wassenhove V, Grant KW, Poeppel D. Visual speech speeds up the neural processing of auditory speech. Proc Natl Acad Sci. 2005;102: 1181–1186. 10.1073/pnas.0408949102 - DOI - PMC - PubMed
1. Massaro DW. Perceiving talking faces: from speech perception to a behavioral principle. Cambridge, Mass: MIT Press; 1998.
1. Andersen TS. The early maximum likelihood estimation model of audiovisual integration in speech perception. J Acoust Soc Am. 2015;137: 2884–2891. 10.1121/1.4916691 - DOI - PubMed
1. Ernst MO, Banks MS. Humans integrate visual and haptic information in a statistically optimal fashion. Nature. 2002;415: 429–433. 10.1038/415429a - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Affiliations

Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources