Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 6;44(10):e0870232023.
doi: 10.1523/JNEUROSCI.0870-23.2023.

Attention Drives Visual Processing and Audiovisual Integration During Multimodal Communication

Affiliations

Attention Drives Visual Processing and Audiovisual Integration During Multimodal Communication

Noor Seijdel et al. J Neurosci. .

Abstract

During communication in real-life settings, our brain often needs to integrate auditory and visual information and at the same time actively focus on the relevant sources of information, while ignoring interference from irrelevant events. The interaction between integration and attention processes remains poorly understood. Here, we use rapid invisible frequency tagging and magnetoencephalography to investigate how attention affects auditory and visual information processing and integration, during multimodal communication. We presented human participants (male and female) with videos of an actress uttering action verbs (auditory; tagged at 58 Hz) accompanied by two movie clips of hand gestures on both sides of fixation (attended stimulus tagged at 65 Hz; unattended stimulus tagged at 63 Hz). Integration difficulty was manipulated by a lower-order auditory factor (clear/degraded speech) and a higher-order visual semantic factor (matching/mismatching gesture). We observed an enhanced neural response to the attended visual information during degraded speech compared to clear speech. For the unattended information, the neural response to mismatching gestures was enhanced compared to matching gestures. Furthermore, signal power at the intermodulation frequencies of the frequency tags, indexing nonlinear signal interactions, was enhanced in the left frontotemporal and frontal regions. Focusing on the left inferior frontal gyrus, this enhancement was specific for the attended information, for those trials that benefitted from integration with a matching gesture. Together, our results suggest that attention modulates audiovisual processing and interaction, depending on the congruence and quality of the sensory input.

Keywords: attention; audiovisual integration; magnetoencephalography (MEG); multimodal communication; neural processing; rapid invisible frequency tagging (RIFT).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1.
Figure 1.
Experimental paradigm. Participants were asked to attend to one of the videos, indicated by a cue. The attended video was frequency-tagged at 65 Hz, and the unattended video at 63 Hz. Speech was frequency-tagged at 58 Hz. Participants were asked to attentively watch and listen to the videos. After the video, participants were presented with four written options and had to identify which verb they heard in the video by pressing one of 4 buttons on an MEG-compatible button box. This task ensured that participants were attentively watching the videos and was used to check whether the verbs were understood. Participants were instructed not to blink during the video presentation. In addition to the normal trials, “attention trials” were included in which participants were asked to detect a change in brightness.
Figure 2.
Figure 2.
Verb categorization behavior. A, Accuracy results per condition. Response accuracy is highest for clear speech conditions and when a gesture matches the speech signal. B, RT per condition. RT are faster in clear speech and when a gesture matches the speech signal.
Figure 3.
Figure 3.
Power at temporal and occipital sensors and corresponding source regions (% increased compared to a poststimulus baseline) averaged across conditions. A, Average ERF for a single subject at selected sensors overlying the left and right temporal lobe. Auditory input was tagged by 58 Hz amplitude modulation. Tagging was phase-locked over trials. ERFs show combined planar gradient data. B, Average ERF for a single subject at selected sensors overlying the occipital lobe. Visual input was tagged by 65 Hz and a 63 Hz flicker. C, Power increase in temporal sensors at the tagged frequency of the auditory stimulus (58 Hz). D, Power increases in occipital sensors are observed at the visual tagging frequencies (63 Hz: unattended; 65 Hz: attended). E, Power increase in the auditory cortex at the tagged frequency of the auditory stimulus (58 Hz). F, Power increases in the visual cortex observed at the visual tagging frequencies (63 Hz, unattended; 65 Hz, attended). The shaded error bars represent the standard error.
Figure 4.
Figure 4.
Sources of power at the auditory tagged signal at 58 Hz and the visually tagged signals at 65 Hz and 63 Hz. A, Power change in percentage when comparing power values in the stimulus window to a poststimulus baseline for the different tagging frequencies, pooled over conditions. Power change is the largest over temporal regions for the auditory tagging frequency and largest over occipital regions for the visually tagged signals. B, Power change values in percentage extracted from the ROIs. Raincloud plots reveal raw data, density, and boxplots for power change in different conditions. CM, clear speech with a matching gesture; CMM, clear speech with a mismatching gesture; DM, degraded speech with a matching gesture; DMM, degraded speech with a mismatching gesture.
Figure 5.
Figure 5.
Power at the intermodulation frequencies (fvisualfauditory). A, Power over left frontal sensors (% increased compared to a poststimulus baseline). B, Power over LIFG source region (% increased compared to a poststimulus baseline). C, Sources of power at 7 Hz. D, Power change values in percentage extracted from the LIFG in source space. Raincloud plots reveal raw data, density, and boxplots for power change per condition.
Figure 6.
Figure 6.
A, Power over LIFG source region (% increased compared to the mismatching gesture conditions). The shaded error bars represent the standard error. B, Power was higher in the CM condition compared to the CMM condition across the temporal lobe. Comparing DM and DMM, we observed enhanced activity in LIFG, left parietal regions, and occipital cortex.

Similar articles

Cited by

References

    1. Ahmed F, Nidiffer AR, O’Sullivan AE, Zuk NJ, Lalor EC (2023) The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Neuroimage 274:120143. 10.1016/j.neuroimage.2023.120143 - DOI - PubMed
    1. Alsius A, Möttönen R, Sams ME, Soto-Faraco S, Tiippana K (2014) Effect of attentional load on audiovisual speech perception: evidence from ERPs. Front Psychol 5:727. 10.3389/fpsyg.2014.00727 - DOI - PMC - PubMed
    1. Alsius A, Navarra J, Campbell R, Soto-Faraco S (2005) Audiovisual integration of speech falters under high attention demands. Curr Biol 15:839–843. 10.1016/j.cub.2005.03.046 - DOI - PubMed
    1. Alsius A, Navarra J, Soto-Faraco S (2007) Attention to touch weakens audiovisual speech integration. Exp Brain Res 183:399–404. 10.1007/s00221-007-1110-1 - DOI - PubMed
    1. Alsius A, Soto-Faraco S (2011) Searching for audiovisual correspondence in multiple speaker scenarios. Exp Brain Res 213:175–183. 10.1007/s00221-011-2624-0 - DOI - PubMed

Publication types

LinkOut - more resources