Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 1:274:120143.
doi: 10.1016/j.neuroimage.2023.120143. Epub 2023 Apr 29.

The integration of continuous audio and visual speech in a cocktail-party environment depends on attention

Affiliations

The integration of continuous audio and visual speech in a cocktail-party environment depends on attention

Farhin Ahmed et al. Neuroimage. .

Abstract

In noisy environments, our ability to understand speech benefits greatly from seeing the speaker's face. This is attributed to the brain's ability to integrate audio and visual information, a process known as multisensory integration. In addition, selective attention plays an enormous role in what we understand, the so-called cocktail-party phenomenon. But how attention and multisensory integration interact remains incompletely understood, particularly in the case of natural, continuous speech. Here, we addressed this issue by analyzing EEG data recorded from participants who undertook a multisensory cocktail-party task using natural speech. To assess multisensory integration, we modeled the EEG responses to the speech in two ways. The first assumed that audiovisual speech processing is simply a linear combination of audio speech processing and visual speech processing (i.e., an A + V model), while the second allows for the possibility of audiovisual interactions (i.e., an AV model). Applying these models to the data revealed that EEG responses to attended audiovisual speech were better explained by an AV model, providing evidence for multisensory integration. In contrast, unattended audiovisual speech responses were best captured using an A + V model, suggesting that multisensory integration is suppressed for unattended speech. Follow up analyses revealed some limited evidence for early multisensory integration of unattended AV speech, with no integration occurring at later levels of processing. We take these findings as evidence that the integration of natural audio and visual speech occurs at multiple levels of processing in the brain, each of which can be differentially affected by attention.

Keywords: Cocktail party; Hierarchical processing; Multisensory integration; Speech.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no competing interests, financial or otherwise.

Figures

Fig. 1.
Fig. 1.. Experimental procedure and analysis approach.
Silhouette images are used to protect identity of the speakers. A. Participants were presented with a male speaker in −9 dB noise in audiovisual (AV), audio-only (A) and video-only (V) format. Envelope reconstruction models AV and (A + V) decoders were derived from their EEG recordings. The channel weightings for each decoder averaged across time-lags from 0 to 500 ms and across all participants (N = 21) are shown on the right. B. Part of this figure is replicated from O’Sullivan et al., 2019. A separate set of participants performed a multisensory cocktail-party task where they were presented with an audiovisual speaker (A1V1) directly in front of them and an audio only speaker (A1) at 30° to their right (over headphones). They had to either attend or ignore A1V1. The AV and A + V decoders obtained from the single-speaker paradigm were applied to reconstruct the envelope of the audiovisual speaker (A1) when he was attended vs unattended. Multisensory integration in both conditions was quantified as the difference between reconstruction accuracy (Pearson’s r between the actual and reconstructed envelope of A1) using the AV decoder (rAV) and A + V decoder (rA + V ).
Fig. 2.
Fig. 2.. Robust multisensory integration when AV speech is attended vs unattended.
A. Both attended (left) and unattended (right) audiovisual speech envelope could be reconstructed significantly better than chance (determined by permutation test). B. Grand-average (N = 17) normalized multisensory gain was positive for attended, and negative for unattended audiovisual speech. Error bars indicate standard error of the mean C. Normalized multisensory gain at single-participant level. The sharp dissociation across attention (more positive multisensory gain for attended, than for unattended; trial-averaged) was visible for all but participant 4. D. Topographical distribution of EEG prediction accuracies (obtained from forward modeling). The black markers indicate channels where the multisensory integration effect [AV – (A + V)] was significant across participants (p < 0.05, two-sided cluster-based permutation test). *** p < 0.001, ** p < 0.01, * p < 0.05.
Fig. 3.
Fig. 3.. Spatiotemporal analysis reveals more complex dissociation of multisensory interaction across attention conditions.
A. Single-lag reconstruction accuracies obtained using AV, A and V decoders at every lag between −500 and 500 msec using the single-speaker in noise data. B. Single-lag reconstruction accuracies obtained using AV and (A + V) decoders at every lag between −500 and 500 msec using the cocktail party data. The green and the purple dots indicate time lags when the reconstruction accuracies of the speech envelope are significantly above chance using the AV and (A + V) decoders respectively (p <0.01, permutation tests). The gray rectangle running along the bottom of each panel indicates time lags when the statistical comparison between performances of AV and (A + V) decoders were made. The black markers in each panel indicate time lags when the multisensory integration effect [AV-(A + V)] was significant across participants (p < 0.05, paired t-test, FDR corrected). C. Topographic distributions of EEG prediction accuracies at specific time lags.

References

    1. Algazi VR Duda RO Thompson DM and Avendano C, “The CIPIC HRTF database,” Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), New Platz, NY, USA, 2001, pp. 99–102, doi: 10.1109/ASPAA.2001.969552 - DOI
    1. Alsius Agnès, Möttönen Riikka, Sams Mikko E., Soto-Faraco Salvador, Tiippana Kaisa, 2014. Effect of attentional load on audiovisual speech perception: evidence from ERPs. Front Psychol 5, 1–9. doi: 10.3389/fpsyg.2014.00727, JUL. - DOI - PMC - PubMed
    1. Alsius Agnès, Navarra Jordi, Campbell Ruth, Soto-Faraco Salvador, 2005. Audiovisual integration of speech falters under high attention demands. Current Biology 15 (9), 839–843. doi: 10.1016/j.cub.2005.03.046. - DOI - PubMed
    1. Alsius Agnès, Navarra Jordi, Soto-Faraco Salvador, 2007. Attention to touch weakens audiovisual speech integration. Exp Brain Res 183, 399–404. doi: 10.1007/s00221-007-1110-1. - DOI - PubMed
    1. Atilgan Huriye, Town Stephen M., Wood Katherine C., Jones Gareth P., Maddox Ross K., Lee Adrian K.C., Bizley Jennifer K., 2018. Integration of Visual Information in Auditory Cortex Promotes Auditory Scene Analysis through Multisensory Binding. Neuron 97 (3), 640–655. doi: 10.1016/j.neuron.2017.12.034, e4. - DOI - PMC - PubMed

Publication types

LinkOut - more resources