Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 21;36(38):9888-95.
doi: 10.1523/JNEUROSCI.1396-16.2016.

Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

Affiliations

Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration

Michael J Crosse et al. J Neurosci. .

Abstract

Speech comprehension is improved by viewing a speaker's face, especially in adverse hearing conditions, a principle known as inverse effectiveness. However, the neural mechanisms that help to optimize how we integrate auditory and visual speech in such suboptimal conversational environments are not yet fully understood. Using human EEG recordings, we examined how visual speech enhances the cortical representation of auditory speech at a signal-to-noise ratio that maximized the perceptual benefit conferred by multisensory processing relative to unisensory processing. We found that the influence of visual input on the neural tracking of the audio speech signal was significantly greater in noisy than in quiet listening conditions, consistent with the principle of inverse effectiveness. Although envelope tracking during audio-only speech was greatly reduced by background noise at an early processing stage, it was markedly restored by the addition of visual speech input. In background noise, multisensory integration occurred at much lower frequencies and was shown to predict the multisensory gain in behavioral performance at a time lag of ∼250 ms. Critically, we demonstrated that inverse effectiveness, in the context of natural audiovisual (AV) speech processing, relies on crossmodal integration over long temporal windows. Our findings suggest that disparate integration mechanisms contribute to the efficient processing of AV speech in background noise.

Significance statement: The behavioral benefit of seeing a speaker's face during conversation is especially pronounced in challenging listening environments. However, the neural mechanisms underlying this phenomenon, known as inverse effectiveness, have not yet been established. Here, we examine this in the human brain using natural speech-in-noise stimuli that were designed specifically to maximize the behavioral benefit of audiovisual (AV) speech. We find that this benefit arises from our ability to integrate multimodal information over longer periods of time. Our data also suggest that the addition of visual speech restores early tracking of the acoustic speech signal during excessive background noise. These findings support and extend current mechanistic perspectives on AV speech perception.

Keywords: EEG; envelope tracking; multisensory integration; speech intelligibility; speech-in-noise; stimulus reconstruction.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Audio stimuli and behavioral measures. A, Spectrograms of a 4 s segment of speech-in-quiet (left) and speech-in-noise (−9 dB; right). B, Subjectively rated intelligibility for speech-in-noise reported after each 60 s trial. White bar represents the sum of the unisensory scores. Error bars indicate SEM across subjects. Brackets indicate pairwise statistical comparisons (*p < 0.05; **p < 0.01; ***p < 0.001). C, Detection accuracy (left) of target words represented as F1 scores. The dashed black trace represents the statistical facilitation predicted by the unisensory scores. Multisensory gain (right) is represented as a percentage of unisensory performance.
Figure 2.
Figure 2.
Stimulus reconstruction and relationship with behavior. A, Reconstruction accuracy (left) obtained using decoders that integrated EEG across a 500 ms window. The dashed black trace represents the unisensory additive model. The shaded area indicates the 95th percentile of chance-level reconstruction accuracy (permutation test). Multisensory gain (right) represented as a percentage of unisensory performance. Error bars indicate SEM across subjects. Brackets indicate pairwise statistical comparisons (**p < 0.01; ***p < 0.001). B, Reconstruction accuracy obtained using single-lag decoders at every lag between 0 and 500 ms. The markers running along the bottom of each plot indicate the time lags at which MSIEEG is significant (p < 0.05, Holm–Bonferroni corrected). C, Correlation coefficient (top) and corresponding p-value (bottom) between MSIEEG and MSIBehav at individual time lags for speech-in-noise. The shaded area indicates the lags at which the correlation is significant or trending toward significance (220–250 ms; p < 0.05). D, Correlation corresponding to shaded area in C with MSIEEG and MSIBehav represented in their original units (left) and as percentage gain (right).
Figure 3.
Figure 3.
AV speech integration at multiple timescales. A, Reconstruction accuracy for AV (blue) and A+V (green) at each frequency band. The shaded area indicates the 5th to 95th percentile of chance-level reconstruction accuracy (permutation test). Error bars indicate SEM across subjects. B, Multisensory enhancement at each frequency band. The markers indicate frequency bands at which there was a significant multisensory interaction effect (p < 0.05, Holm–Bonferroni corrected). C, Average rate of different linguistic units derived from the audio files of the speech stimuli using phoneme-alignment software. The brackets indicate mean ± SD.
Figure 4.
Figure 4.
AV temporal integration. A, Model performance by decoder temporal window size. Error bars indicate SEM across participants. B, Multisensory gain by decoder temporal window size. Markers indicate window sizes at which there was significant inverse effectiveness (i.e., −9 dB > quiet; *p < 0.05; **p < 0.01). C, Inverse effectiveness by decoder temporal window size.

References

    1. Arnal LH, Morillon B, Kell CA, Giraud AL. Dual neural routing of visual facilitation in speech processing. J Neurosci. 2009;29:13445–13453. doi: 10.1523/JNEUROSCI.3194-09.2009. - DOI - PMC - PubMed
    1. Baart M, Stekelenburg JJ, Vroomen J. Electrophysiological evidence for speech-specific audiovisual integration. Neuropsychologia. 2014;53:115–121. doi: 10.1016/j.neuropsychologia.2013.11.011. - DOI - PubMed
    1. Bernstein LE, Auer ET, Jr, Takayanagi S. Auditory speech detection in noise enhanced by lipreading. Speech Communication. 2004;44:5–18. doi: 10.1016/j.specom.2004.10.011. - DOI
    1. Bernstein LE, Auer ET, Jr, Wagner M, Ponton CW. Spatiotemporal dynamics of audiovisual speech processing. Neuroimage. 2008;39:423–435. doi: 10.1016/j.neuroimage.2007.08.035. - DOI - PMC - PubMed
    1. Besle J, Fort A, Delpuech C, Giard MH. Bimodal speech: early suppressive visual effects in human auditory cortex. Eur J Neurosci. 2004;20:2225–2234. doi: 10.1111/j.1460-9568.2004.03670.x. - DOI - PMC - PubMed

Publication types

LinkOut - more resources