Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 26:15:661477.
doi: 10.3389/fnins.2021.661477. eCollection 2021.

Cross-Modal Interaction Between Auditory and Visual Input Impacts Memory Retrieval

Affiliations

Cross-Modal Interaction Between Auditory and Visual Input Impacts Memory Retrieval

Viorica Marian et al. Front Neurosci. .

Abstract

How we perceive and learn about our environment is influenced by our prior experiences and existing representations of the world. Top-down cognitive processes, such as attention and expectations, can alter how we process sensory stimuli, both within a modality (e.g., effects of auditory experience on auditory perception), as well as across modalities (e.g., effects of visual feedback on sound localization). Here, we demonstrate that experience with different types of auditory input (spoken words vs. environmental sounds) modulates how humans remember concurrently-presented visual objects. Participants viewed a series of line drawings (e.g., picture of a cat) displayed in one of four quadrants while listening to a word or sound that was congruent (e.g., "cat" or <meow>), incongruent (e.g., "motorcycle" or <vroom-vroom>), or neutral (e.g., a meaningless pseudoword or a tonal beep) relative to the picture. Following the encoding phase, participants were presented with the original drawings plus new drawings and asked to indicate whether each one was "old" or "new." If a drawing was designated as "old," participants then reported where it had been displayed. We find that words and sounds both elicit more accurate memory for what objects were previously seen, but only congruent environmental sounds enhance memory for where objects were positioned - this, despite the fact that the auditory stimuli were not meaningful spatial cues of the objects' locations on the screen. Given that during real-world listening conditions, environmental sounds, but not words, reliably originate from the location of their referents, listening to sounds may attune the visual dorsal pathway to facilitate attention and memory for objects' locations. We propose that audio-visual associations in the environment and in our previous experience jointly contribute to visual memory, strengthening visual memory through exposure to auditory input.

Keywords: audio-visual processing; auditory experience; cross-modal interaction; environmental sounds; multisensory integration; spatial memory; spoken words; visual memory.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Possible processing route whereby activation from auditory (e.g., a meowing sound) or visual (e.g., a picture of a cat) stimuli could spread to a corresponding conceptual representation (e.g., of a cat) and associated visual features, thereby strengthening the salience of the visual object.
FIGURE 2
FIGURE 2
Example multi-trial run of spoken words (Top) and environmental sounds (Bottom) during encoding. On each trial, participants were presented with a central fixation cross for 200 ms, which was replaced by the concurrent presentation of a task-irrelevant, spatially uninformative auditory cue and a picture in one of four locations, which remained on screen for 1,000 ms prior to the beginning of the next trial.
FIGURE 3
FIGURE 3
Example spoken word (Top) and environmental sound (Bottom) retrieval trial. On each trial, participants were first presented with a single picture in the center of the screen and asked to indicate whether they recognized it from the encoding phase by clicking “old” or whether it was not previously seen by clicking “new.” If a picture was designated as “old,” participants were then asked to indicate the spatial location where the picture appeared during the encoding phase by clicking on one of four quadrants labeled “top left,” “top right,” “bottom left,” or “bottom right.”
FIGURE 4
FIGURE 4
Effect of trial type on recognition accuracy for spoken words. Visual objects initially paired with congruent (white) or incongruent (black) spoken words were recognized with significantly greater accuracy than those paired with neutral tones (solid gray) and with marginally greater accuracy than pictures paired with neutral words (striped gray). Accuracy did not differ between neutral tone and neutral word trials or between congruent and incongruent trials. Error bars represent standard error. p < 0.10, *p < 0.05.
FIGURE 5
FIGURE 5
Effect of trial type on location accuracy for spoken words. Location accuracy did not differ between congruent (white), neutral word (striped gray), neutral tone (solid gray), and incongruent (black) spoken word trials. Error bars represent standard error.
FIGURE 6
FIGURE 6
Effect of trial type on recognition accuracy for environmental sounds. Visual objects initially paired with congruent (white) or incongruent (black) environmental sounds were recognized with significantly greater accuracy than those paired with neutral sounds (gray). Congruent and incongruent trials did not significantly differ from each other. Error bars represent standard error. *p < 0.05, ***p < 0.001.
FIGURE 7
FIGURE 7
Effect of trial type on location accuracy for environmental sounds. Locations of visual objects initially paired with congruent (white) environmental sounds were remembered with significantly greater accuracy than those paired with neutral (gray) or incongruent (black) sounds. Neutral and incongruent trials did not significantly differ from each other. Error bars represent standard error. ***p < 0.001.

References

    1. Barr D. J., Levy R., Scheepers C., Tily H. J. (2013). Random effects structure for confirmatory hypothesis testing: keep it maximal. J. Mem. Lang. 68 255–278. 10.1016/j.jml.2012.11.001 - DOI - PMC - PubMed
    1. Bartolotti J., Marian V. (2012). Language learning and control in monolinguals and bilinguals. Cogn. Sci. 36 1129–1147. 10.1111/j.1551-6709.2012.01243.x - DOI - PMC - PubMed
    1. Bartolotti J., Schroeder S. R., Hayakawa S., Rochanavibhata S., Chen P., Marian V. (2020). Listening to speech and non-speech sounds activates phonological and semantic knowledge differently. Q. J. Exp. Psychol. 73 1135–1149. 10.1177/1747021820923944 - DOI - PMC - PubMed
    1. Beauchamp M. S., Argall B. D., Bodurka J., Duyn J. H., Martin A. (2004a). Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nat. Neurosci. 7 1190–1192. 10.1038/nn1333 - DOI - PubMed
    1. Beauchamp M. S., Lee K. E., Argall B. D., Martin A. (2004b). Integration of auditory and visual information about objects in superior temporal sulcus. Neuron 41 809–823. 10.1016/s0896-6273(04)00070-4 - DOI - PubMed