Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 28;7(22):eabe7547.
doi: 10.1126/sciadv.abe7547. Print 2021 May.

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Affiliations

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Meenakshi Khosla et al. Sci Adv. .

Abstract

Naturalistic stimuli, such as movies, activate a substantial portion of the human brain, invoking a response shared across individuals. Encoding models that predict neural responses to arbitrary stimuli can be very useful for studying brain function. However, existing models focus on limited aspects of naturalistic stimuli, ignoring the dynamic interactions of modalities in this inherently context-rich paradigm. Using movie-watching data from the Human Connectome Project, we build group-level models of neural activity that incorporate several inductive biases about neural information processing, including hierarchical processing, temporal assimilation, and auditory-visual interactions. We demonstrate how incorporating these biases leads to remarkable prediction performance across large areas of the cortex, beyond the sensory-specific cortices into multisensory sites and frontal cortex. Furthermore, we illustrate that encoding models learn high-level concepts that generalize to task-bound paradigms. Together, our findings underscore the potential of encoding models as powerful tools for studying brain function in ecologically valid conditions.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1. Schematic of the proposed models.
(A) The short-duration (1 s) auditory and visual models take a single image or spectrogram as input, extract multiscale hierarchical features, and feed them into a convolutional neural network (CNN)–based response model to predict the whole-brain response. (B) The long-duration (20-s) unimodal models take a sequence of images or spectrograms as input, feed their hierarchical features into a recurrent pathway, and extract the last hidden state representation for the response model. (C) The short-duration multimodal model combines unimodal features and passes them into the response model. (D) The long-duration multimodal model combines auditory and visual representations from the recurrent pathways for whole-brain prediction. Architectural details, including the feature extractor and convolutional response model, are provided in the Supplementary Materials.
Fig. 2
Fig. 2. Regional predictive accuracy for the test movie.
(A and C to F) Quantitative evaluation metrics for all the proposed models across major groups of regions as identified in the HCP MMP parcellation (B). Predictive accuracy of all models is summarized across (A) auditory, (C) visual, (D) multisensory, (E) language, and (F) frontal areas. Box plots depict quartiles, and swarmplots depict mean prediction accuracy of every ROI in the group. For language areas (group 4), left and right hemisphere ROIs are shown as separate points in the swarmplot because of marked differences in prediction accuracy. Statistical significance tests are performed to compare 1-s and 20-s models of the same modality (three comparisons; results are indicated with horizontal bars below the box plots) or unimodal against multimodal models of the same duration (four comparisons; results are indicated with horizontal bars above the box plots) using the paired t test (P < 0.05, Bonferroni corrected) on mean prediction accuracy within ROIs of each group.
Fig. 3
Fig. 3. Model prediction accuracy in standard brain space.
Left panel depicts the predictive accuracy of unimodal (A and B) and multimodal (C) models over the whole brain in the test movie. Colors on the brain surface indicate the Pearson correlation coefficient between the predicted time series at each voxel and the true voxel’s time series normalized by the noise ceiling (D) computed on repeated validation clips. Only significantly predicted voxels [P < 0.05, False Discovery Rate (FDR) (59) corrected] are colored. ROI box plots depict the un-normalized correlation coefficients between the predicted and measured response of voxels in each ROI and the respective noise ceiling for the mean. (E) Percentage of voxels in stimulus-driven cortex that are significantly predicted by each model and mean prediction accuracy across the stimulus-driven cortex.
Fig. 4
Fig. 4. Influence of temporal history on encoding performance.
(A) Mean predictive performance of audio 1-s and audio 20-s models in early auditory and association auditory cortex ROIs. A major boost in encoding performance is seen across auditory association regions with the 20-s model. (B) Mean predictive performance of visual 1-s and visual 20-s models across ROIs in the dorsal, ventral, and MT+ regions. Dorsal stream and MT+ ROIs exhibit a significant improvement with visual 20-s model, but no effect is observed for the ventral stream. Box plots are overlaid on top of the beeswarm plot to depict quartiles. Horizontal bars indicate significant differences between models in the mean prediction accuracy within ROIs of each stream using the paired t test (P < 0.05).
Fig. 5
Fig. 5. Sensitivity of ROIs to different sensory inputs.
(A) Predictive accuracy (R) of audiovisual encoding model with and without input distortions, (B) Sensory sensitivity index of different brain regions as determined using performance metrics under input distortion (see the Supplementary Materials for details). Regions dominated by a single modality are shown in darker colors, whereas light-colored regions are better predicted by a combination of auditory and visual information. Red indicates auditory-dominant regions, whereas blue indicates visual dominance.
Fig. 6
Fig. 6. Encoding models as virtual brain activity synthesizers.
(A) Synthetic contrasts are generated from trained encoding models by contrasting their “synthesized” (i.e., predicted) response to different stimulus types. (B) Comparison of the synthesized contrast for “speech” against the speech association template on neurosynth, both thresholded to keep the top 5, 10, or 15% most activated vertices. (C) and (D) compare the synthesized contrasts for “faces” and “places” against the corresponding contrasts derived from HCP tfMRI experiments, both thresholded to keep the top 5, 10, or 15% most activated vertices. Vertices activated in only synthetic or predicted contrast maps are shown in red and blue colors, respectively, whereas yellow indicates the overlap. Corresponding Dice scores are displayed alongside the surface maps. Distributions of Dice overlap scores between the synthetic map and all 86 HCP tfMRI contrast maps are shown as histograms at each threshold level. Red arrow points to the Dice overlap between the synthetic contrast and HCP tfMRI contrast for the same condition. In all cases, the synthetic contrast exhibits the highest agreement with the tfMRI contrast that it was generated to predict.

Similar articles

Cited by

References

    1. Varoquaux G., Poldrack R. A., Predictive models avoid excessive reductionism in cognitive neuroimaging. Curr. Opin. Neurobiol. 55, 1–6 (2019). - PubMed
    1. Yamins D. L. K., Hong H., Cadieu C. F., Solomon E. A., Seibert D., DiCarlo J. J., Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. U.S.A. 111, 8619–8624 (2014). - PMC - PubMed
    1. Kay K. N., Naselaris T., Prenger R. J., Gallant J. L., Identifying natural images from human brain activity. Nature 452, 352–355 (2008). - PMC - PubMed
    1. Wen H., Shi J., Zhang Y., Lu K.-H., Cao J., Liu Z., Neural encoding and decoding with deep learning for dynamic natural vision. Cereb. Cortex 28, 4136–4160 (2018). - PMC - PubMed
    1. Güçlü U., van Gerven M. A. J., Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015). - PMC - PubMed

Publication types

LinkOut - more resources