Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

Michael J Crosse¹, Giovanni M Di Liberto¹, Edmund C Lalor²

Affiliations

¹ School of Engineering, Trinity Centre for Bioengineering, and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland.
² School of Engineering, Trinity Centre for Bioengineering, and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland edmund_lalor@urmc.rochester.edu.

PMID: 27656026
PMCID: PMC6705572
DOI: 10.1523/JNEUROSCI.1396-16.2016

Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

Michael J Crosse et al. J Neurosci. 2016.

. 2016 Sep 21;36(38):9888-95.

doi: 10.1523/JNEUROSCI.1396-16.2016.

Authors

Michael J Crosse¹, Giovanni M Di Liberto¹, Edmund C Lalor²

Affiliations

¹ School of Engineering, Trinity Centre for Bioengineering, and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland.
² School of Engineering, Trinity Centre for Bioengineering, and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland edmund_lalor@urmc.rochester.edu.

PMID: 27656026
PMCID: PMC6705572
DOI: 10.1523/JNEUROSCI.1396-16.2016

Abstract

Speech comprehension is improved by viewing a speaker's face, especially in adverse hearing conditions, a principle known as inverse effectiveness. However, the neural mechanisms that help to optimize how we integrate auditory and visual speech in such suboptimal conversational environments are not yet fully understood. Using human EEG recordings, we examined how visual speech enhances the cortical representation of auditory speech at a signal-to-noise ratio that maximized the perceptual benefit conferred by multisensory processing relative to unisensory processing. We found that the influence of visual input on the neural tracking of the audio speech signal was significantly greater in noisy than in quiet listening conditions, consistent with the principle of inverse effectiveness. Although envelope tracking during audio-only speech was greatly reduced by background noise at an early processing stage, it was markedly restored by the addition of visual speech input. In background noise, multisensory integration occurred at much lower frequencies and was shown to predict the multisensory gain in behavioral performance at a time lag of ∼250 ms. Critically, we demonstrated that inverse effectiveness, in the context of natural audiovisual (AV) speech processing, relies on crossmodal integration over long temporal windows. Our findings suggest that disparate integration mechanisms contribute to the efficient processing of AV speech in background noise.

Significance statement: The behavioral benefit of seeing a speaker's face during conversation is especially pronounced in challenging listening environments. However, the neural mechanisms underlying this phenomenon, known as inverse effectiveness, have not yet been established. Here, we examine this in the human brain using natural speech-in-noise stimuli that were designed specifically to maximize the behavioral benefit of audiovisual (AV) speech. We find that this benefit arises from our ability to integrate multimodal information over longer periods of time. Our data also suggest that the addition of visual speech restores early tracking of the acoustic speech signal during excessive background noise. These findings support and extend current mechanistic perspectives on AV speech perception.

Keywords: EEG; envelope tracking; multisensory integration; speech intelligibility; speech-in-noise; stimulus reconstruction.

PubMed Disclaimer

Figures

**Figure 1.**
Audio stimuli and behavioral measures. A, Spectrograms of a 4 s segment of speech-in-quiet (left) and speech-in-noise (−9 dB; right). B, Subjectively rated intelligibility for speech-in-noise reported after each 60 s trial. White bar represents the sum of the unisensory scores. Error bars indicate SEM across subjects. Brackets indicate pairwise statistical comparisons (*p < 0.05; **p < 0.01; ***p < 0.001). C, Detection accuracy (left) of target words represented as F₁ scores. The dashed black trace represents the statistical facilitation predicted by the unisensory scores. Multisensory gain (right) is represented as a percentage of unisensory performance.

**Figure 2.**
Stimulus reconstruction and relationship with behavior. A, Reconstruction accuracy (left) obtained using decoders that integrated EEG across a 500 ms window. The dashed black trace represents the unisensory additive model. The shaded area indicates the 95th percentile of chance-level reconstruction accuracy (permutation test). Multisensory gain (right) represented as a percentage of unisensory performance. Error bars indicate SEM across subjects. Brackets indicate pairwise statistical comparisons (**p < 0.01; ***p < 0.001). B, Reconstruction accuracy obtained using single-lag decoders at every lag between 0 and 500 ms. The markers running along the bottom of each plot indicate the time lags at which MSI_EEG is significant (p < 0.05, Holm–Bonferroni corrected). C, Correlation coefficient (top) and corresponding p-value (bottom) between MSI_EEG and MSI_Behav at individual time lags for speech-in-noise. The shaded area indicates the lags at which the correlation is significant or trending toward significance (220–250 ms; p < 0.05). D, Correlation corresponding to shaded area in C with MSI_EEG and MSI_Behav represented in their original units (left) and as percentage gain (right).

**Figure 3.**
AV speech integration at multiple timescales. A, Reconstruction accuracy for AV (blue) and A+V (green) at each frequency band. The shaded area indicates the 5th to 95th percentile of chance-level reconstruction accuracy (permutation test). Error bars indicate SEM across subjects. B, Multisensory enhancement at each frequency band. The markers indicate frequency bands at which there was a significant multisensory interaction effect (p < 0.05, Holm–Bonferroni corrected). C, Average rate of different linguistic units derived from the audio files of the speech stimuli using phoneme-alignment software. The brackets indicate mean ± SD.

**Figure 4.**
AV temporal integration. A, Model performance by decoder temporal window size. Error bars indicate SEM across participants. B, Multisensory gain by decoder temporal window size. Markers indicate window sizes at which there was significant inverse effectiveness (i.e., −9 dB > quiet; *p < 0.05; **p < 0.01). C, Inverse effectiveness by decoder temporal window size.

See this image and copyright information in PMC

References

1. Arnal LH, Morillon B, Kell CA, Giraud AL. Dual neural routing of visual facilitation in speech processing. J Neurosci. 2009;29:13445–13453. doi: 10.1523/JNEUROSCI.3194-09.2009. - DOI - PMC - PubMed
1. Baart M, Stekelenburg JJ, Vroomen J. Electrophysiological evidence for speech-specific audiovisual integration. Neuropsychologia. 2014;53:115–121. doi: 10.1016/j.neuropsychologia.2013.11.011. - DOI - PubMed
1. Bernstein LE, Auer ET, Jr, Takayanagi S. Auditory speech detection in noise enhanced by lipreading. Speech Communication. 2004;44:5–18. doi: 10.1016/j.specom.2004.10.011. - DOI
1. Bernstein LE, Auer ET, Jr, Wagner M, Ponton CW. Spatiotemporal dynamics of audiovisual speech processing. Neuroimage. 2008;39:423–435. doi: 10.1016/j.neuroimage.2007.08.035. - DOI - PMC - PubMed
1. Besle J, Fort A, Delpuech C, Giard MH. Bimodal speech: early suppressive visual effects in human auditory cortex. Eur J Neurosci. 2004;20:2225–2234. doi: 10.1111/j.1460-9568.2004.03670.x. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

Affiliations

Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources