. 2021 May 28;7(22):eabe7547.

doi: 10.1126/sciadv.abe7547. Print 2021 May.

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Meenakshi Khosla¹, Gia H Ngo¹, Keith Jamison², Amy Kuceyeski^{2

3}, Mert R Sabuncu^{4

2

5}

Affiliations

¹ School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA.
² Department of Radiology, Weill Cornell Medicine, New York, NY, USA.
³ Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA.
⁴ School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA. msabuncu@cornell.edu.
⁵ Nancy E. & Peter C. Meinig School of Biomedical Engineering, Cornell University, Ithaca, NY, USA.

PMID: 34049888
PMCID: PMC8163078
DOI: 10.1126/sciadv.abe7547

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Meenakshi Khosla et al. Sci Adv. 2021.

. 2021 May 28;7(22):eabe7547.

doi: 10.1126/sciadv.abe7547. Print 2021 May.

Authors

Meenakshi Khosla¹, Gia H Ngo¹, Keith Jamison², Amy Kuceyeski^{2

3}, Mert R Sabuncu^{4

2

5}

Affiliations

¹ School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA.
² Department of Radiology, Weill Cornell Medicine, New York, NY, USA.
³ Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY, USA.
⁴ School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA. msabuncu@cornell.edu.
⁵ Nancy E. & Peter C. Meinig School of Biomedical Engineering, Cornell University, Ithaca, NY, USA.

PMID: 34049888
PMCID: PMC8163078
DOI: 10.1126/sciadv.abe7547

Abstract

Naturalistic stimuli, such as movies, activate a substantial portion of the human brain, invoking a response shared across individuals. Encoding models that predict neural responses to arbitrary stimuli can be very useful for studying brain function. However, existing models focus on limited aspects of naturalistic stimuli, ignoring the dynamic interactions of modalities in this inherently context-rich paradigm. Using movie-watching data from the Human Connectome Project, we build group-level models of neural activity that incorporate several inductive biases about neural information processing, including hierarchical processing, temporal assimilation, and auditory-visual interactions. We demonstrate how incorporating these biases leads to remarkable prediction performance across large areas of the cortex, beyond the sensory-specific cortices into multisensory sites and frontal cortex. Furthermore, we illustrate that encoding models learn high-level concepts that generalize to task-bound paradigms. Together, our findings underscore the potential of encoding models as powerful tools for studying brain function in ecologically valid conditions.

Copyright © 2021 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC).

PubMed Disclaimer

Figures

**Fig. 1. Schematic of the proposed models.**
(A) The short-duration (1 s) auditory and visual models take a single image or spectrogram as input, extract multiscale hierarchical features, and feed them into a convolutional neural network (CNN)–based response model to predict the whole-brain response. (B) The long-duration (20-s) unimodal models take a sequence of images or spectrograms as input, feed their hierarchical features into a recurrent pathway, and extract the last hidden state representation for the response model. (C) The short-duration multimodal model combines unimodal features and passes them into the response model. (D) The long-duration multimodal model combines auditory and visual representations from the recurrent pathways for whole-brain prediction. Architectural details, including the feature extractor and convolutional response model, are provided in the Supplementary Materials.

**Fig. 2. Regional predictive accuracy for the test movie.**
(A and C to F) Quantitative evaluation metrics for all the proposed models across major groups of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of all models is summarized across (A) auditory, (C) visual, (D) multisensory, (E) language, and (F) frontal areas. Box plots depict quartiles, and swarmplots depict mean prediction accuracy of every ROI in the group. For language areas (group 4), left and right hemisphere ROIs are shown as separate points in the swarmplot because of marked differences in prediction accuracy. Statistical significance tests are performed to compare 1-s and 20-s models of the same modality (three comparisons; results are indicated with horizontal bars below the box plots) or unimodal against multimodal models of the same duration (four comparisons; results are indicated with horizontal bars above the box plots) using the paired t test (P < 0.05, Bonferroni corrected) on mean prediction accuracy within ROIs of each group.

**Fig. 3. Model prediction accuracy in standard brain space.**
Left panel depicts the predictive accuracy of unimodal (A and B) and multimodal (C) models over the whole brain in the test movie. Colors on the brain surface indicate the Pearson correlation coefficient between the predicted time series at each voxel and the true voxel’s time series normalized by the noise ceiling (D) computed on repeated validation clips. Only significantly predicted voxels [P < 0.05, False Discovery Rate (FDR) (59) corrected] are colored. ROI box plots depict the un-normalized correlation coefficients between the predicted and measured response of voxels in each ROI and the respective noise ceiling for the mean. (E) Percentage of voxels in stimulus-driven cortex that are significantly predicted by each model and mean prediction accuracy across the stimulus-driven cortex.

**Fig. 4. Influence of temporal history on encoding performance.**
(A) Mean predictive performance of audio 1-s and audio 20-s models in early auditory and association auditory cortex ROIs. A major boost in encoding performance is seen across auditory association regions with the 20-s model. (B) Mean predictive performance of visual 1-s and visual 20-s models across ROIs in the dorsal, ventral, and MT⁺ regions. Dorsal stream and MT⁺ ROIs exhibit a significant improvement with visual 20-s model, but no effect is observed for the ventral stream. Box plots are overlaid on top of the beeswarm plot to depict quartiles. Horizontal bars indicate significant differences between models in the mean prediction accuracy within ROIs of each stream using the paired t test (P < 0.05).

**Fig. 5. Sensitivity of ROIs to different sensory inputs.**
(A) Predictive accuracy (R) of audiovisual encoding model with and without input distortions, (B) Sensory sensitivity index of different brain regions as determined using performance metrics under input distortion (see the Supplementary Materials for details). Regions dominated by a single modality are shown in darker colors, whereas light-colored regions are better predicted by a combination of auditory and visual information. Red indicates auditory-dominant regions, whereas blue indicates visual dominance.

**Fig. 6. Encoding models as virtual brain activity synthesizers.**
(A) Synthetic contrasts are generated from trained encoding models by contrasting their “synthesized” (i.e., predicted) response to different stimulus types. (B) Comparison of the synthesized contrast for “speech” against the speech association template on neurosynth, both thresholded to keep the top 5, 10, or 15% most activated vertices. (C) and (D) compare the synthesized contrasts for “faces” and “places” against the corresponding contrasts derived from HCP tfMRI experiments, both thresholded to keep the top 5, 10, or 15% most activated vertices. Vertices activated in only synthetic or predicted contrast maps are shown in red and blue colors, respectively, whereas yellow indicates the overlap. Corresponding Dice scores are displayed alongside the surface maps. Distributions of Dice overlap scores between the synthetic map and all 86 HCP tfMRI contrast maps are shown as histograms at each threshold level. Red arrow points to the Dice overlap between the synthetic contrast and HCP tfMRI contrast for the same condition. In all cases, the synthetic contrast exhibits the highest agreement with the tfMRI contrast that it was generated to predict.

See this image and copyright information in PMC

Cited by

Dissonant music engages early visual processing.
Bravo F, Glogowski J, Stamatakis EA, Herfert K. Bravo F, et al. Proc Natl Acad Sci U S A. 2024 Jul 23;121(30):e2320378121. doi: 10.1073/pnas.2320378121. Epub 2024 Jul 15. Proc Natl Acad Sci U S A. 2024. PMID: 39008675 Free PMC article.
A large and rich EEG dataset for modeling human visual object recognition.
Gifford AT, Dwivedi K, Roig G, Cichy RM. Gifford AT, et al. Neuroimage. 2022 Dec 1;264:119754. doi: 10.1016/j.neuroimage.2022.119754. Epub 2022 Nov 15. Neuroimage. 2022. PMID: 36400378 Free PMC article.
A hierarchy of processing complexity and timescales for natural sounds in human auditory cortex.
Rupp KM, Hect JL, Harford EE, Holt LL, Ghuman AS, Abel TJ. Rupp KM, et al. bioRxiv [Preprint]. 2024 May 26:2024.05.24.595822. doi: 10.1101/2024.05.24.595822. bioRxiv. 2024. Update in: Proc Natl Acad Sci U S A. 2025 May 6;122(18):e2412243122. doi: 10.1073/pnas.2412243122. PMID: 38826304 Free PMC article. Updated. Preprint.
Encoding models for developmental cognitive computational neuroscience: Promise, challenges, and potential.
Nakai T, Constant-Varlet C, Prado J. Nakai T, et al. Dev Cogn Neurosci. 2024 Dec;70:101470. doi: 10.1016/j.dcn.2024.101470. Epub 2024 Oct 30. Dev Cogn Neurosci. 2024. PMID: 39504850 Free PMC article. Review.
A rubric for human-like agents and NeuroAI.
Momennejad I. Momennejad I. Philos Trans R Soc Lond B Biol Sci. 2023 Jan 30;378(1869):20210446. doi: 10.1098/rstb.2021.0446. Epub 2022 Dec 13. Philos Trans R Soc Lond B Biol Sci. 2023. PMID: 36511409 Free PMC article. Review.

See all "Cited by" articles

References

1. Varoquaux G., Poldrack R. A., Predictive models avoid excessive reductionism in cognitive neuroimaging. Curr. Opin. Neurobiol. 55, 1–6 (2019). - PubMed
1. Yamins D. L. K., Hong H., Cadieu C. F., Solomon E. A., Seibert D., DiCarlo J. J., Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. U.S.A. 111, 8619–8624 (2014). - PMC - PubMed
1. Kay K. N., Naselaris T., Prenger R. J., Gallant J. L., Identifying natural images from human brain activity. Nature 452, 352–355 (2008). - PMC - PubMed
1. Wen H., Shi J., Zhang Y., Lu K.-H., Cao J., Liu Z., Neural encoding and decoding with deep learning for dynamic natural vision. Cereb. Cortex 28, 4136–4160 (2018). - PMC - PubMed
1. Güçlü U., van Gerven M. A. J., Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Affiliations

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources