Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 8;43(10):1731-1741.
doi: 10.1523/JNEUROSCI.1424-22.2022. Epub 2023 Feb 9.

Deep Neural Networks and Visuo-Semantic Models Explain Complementary Components of Human Ventral-Stream Representational Dynamics

Affiliations

Deep Neural Networks and Visuo-Semantic Models Explain Complementary Components of Human Ventral-Stream Representational Dynamics

Kamila M Jozwik et al. J Neurosci. .

Abstract

Deep neural networks (DNNs) are promising models of the cortical computations supporting human object recognition. However, despite their ability to explain a significant portion of variance in neural data, the agreement between models and brain representational dynamics is far from perfect. We address this issue by asking which representational features are currently unaccounted for in neural time series data, estimated for multiple areas of the ventral stream via source-reconstructed magnetoencephalography data acquired in human participants (nine females, six males) during object viewing. We focus on the ability of visuo-semantic models, consisting of human-generated labels of object features and categories, to explain variance beyond the explanatory power of DNNs alone. We report a gradual reversal in the relative importance of DNN versus visuo-semantic features as ventral-stream object representations unfold over space and time. Although lower-level visual areas are better explained by DNN features starting early in time (at 66 ms after stimulus onset), higher-level cortical dynamics are best accounted for by visuo-semantic features starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. These results show that a significant component of the variance unexplained by DNNs in higher-level cortical dynamics is structured and can be explained by readily nameable aspects of the objects. We conclude that current DNNs fail to fully capture dynamic representations in higher-level human visual cortex and suggest a path toward more accurate models of ventral-stream computations.SIGNIFICANCE STATEMENT When we view objects such as faces and cars in our visual environment, their neural representations dynamically unfold over time at a millisecond scale. These dynamics reflect the cortical computations that support fast and robust object recognition. DNNs have emerged as a promising framework for modeling these computations but cannot yet fully account for the neural dynamics. Using magnetoencephalography data acquired in human observers during object viewing, we show that readily nameable aspects of objects, such as 'eye', 'wheel', and 'face', can account for variance in the neural dynamics over and above DNNs. These findings suggest that DNNs and humans may in part rely on different object features for visual recognition and provide guidelines for model improvement.

Keywords: categories; features; object recognition; recurrent deep neural networks; source-reconstructed MEG data; vision.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic overview of approach: stimulus set, models, data, and model fitting. a, Stimulus set. Stimuli are 92 colored images of real-world objects spanning a range of categories, including humans, nonhuman animals, natural objects, and manmade objects. b, Visuo-semantic models and DNNs. Visuo-semantic models consist of human-generated labels of object features and categories for the 92 images. Example labels are shown for the dog face encircled in a. DNNs are feedforward and locally recurrent CORnet architectures trained with category supervision on the ILSVRC database. These architectures are inspired by the processing stages of the primate ventral visual stream from V1 to IT. c, Object representations for each model. We characterized object representations by computing RDMs. We computed one RDM per model dimension, that is, one for each visuo-semantic label or DNN layer. For each visuo-semantic model dimension, RDMs were computed by extracting the value for each image on that dimension and computing pairwise dissimilarities (squared difference) between the values. For each CORnet-Z and CORnet-R layer, RDMs were computed by extracting an activity pattern across model units for each image and computing pairwise dissimilarities (1 minus Spearman's r) between the activity patterns. d, Human source-reconstructed MEG data for an example participant. MEG data were acquired in 15 healthy adult human participants while they were viewing the 92 images (stimulus duration, 500 ms). We analyzed source-reconstructed data from three ROIs, V1–V3, V4t/LO, and IT/PHC. We computed an RDM for each participant, region, and time point. RDMs were computed by extracting an activity pattern for each image and computing pairwise dissimilarities (1 minus Pearson's r) between the activity patterns. e, Schematic overview of model fitting. We tested two model classes, a visuo-semantic model consisting of all category and feature RDMs and a DNN model consisting of all CORnet-Z and CORnet-R layer RDMs. The RDMs serve as model predictors. We first fit each model to the MEG RDMs for each participant, region, and time point, using cross-validated regularized linear regression. The cross-validated model predictions were then used in a second-level GLM approach to estimate the variance explained by each model separately and by both models combined.
Figure 2.
Figure 2.
DNNs better explain lower-level visual representations, and visuo-semantic models better explain higher-level visual representations. a, Variance explained by the DNNs (green) and visuo-semantic models (blue) in the source-reconstructed MEG data. Top, Significant variance explained is indicated by green and blue points (one-sided Wilcoxon signed-rank test, p < 0.05 corrected). Significant differences between models in variance explained are indicated by gray points (two-sided Wilcoxon signed-rank test, p < 0.05 corrected). Lighter colors indicate individually significant time points, and darker colors indicate time points that additionally satisfy a continuity criterion (minimally 20 ms of consecutive significant time points). The shaded area around the lines shows the SEM across participants. The x-axis shows time relative to stimulus onset. The gray horizontal bar on the x-axis indicates the stimulus duration. b, Unique variance explained by the DNNs and visuo-semantic models in the source-reconstructed MEG data. To estimate the unique variance explained by each model class, we used a second-level GLM approach. Unique variance explained was computed by subtracting the variance explained by a reduced GLM (excluding the model class of interest) from the variance explained by a full GLM (including both model classes). Conventions are the same as in a.
Figure 3.
Figure 3.
Object parts and basic categories contribute to the unique variance explained by visuo-semantic models in higher-level visual representations. a, Variance explained by the object features (color, texture, shape, object parts), categories (subordinate, basic, superordinate), and deep neural networks in the source-reconstructed MEG data. Conventions are the same as in Figure 2a. b, Unique variance explained by the object features, categories, and deep neural networks in the source-reconstructed MEG data. Conventions are the same as in Figure 2b.
Figure 4.
Figure 4.
DNNs and visuo-semantic models explain complementary components of human ventral-stream representational dynamics. To summarize our findings, we computed a model difference score based on the results shown in Figure 2b. We subtracted the unique variance explained by the visuo-semantic models from that explained by the DNNs in the dynamic ventral-stream representations. Difference scores are shown for each ROI during the first 600 ms of stimulus processing. Results show a gradual reversal in the relative importance of DNN versus visuo-semantic features in explaining the visual representations as they unfold over space and time. Between 66 and 128 ms after stimulus onset, DNNs outperform visuo-semantic models in lower-level areas V1–V3 (gray line, positive deflection). This early time window is thought to be dominated by feedforward and local recurrent processing. In contrast, starting 146 ms after stimulus onset, visuo-semantic models outperform DNNs in higher-level visual areas IT/PHC (red line, negative deflection). The same pattern of complementary contributions of DNNs and visuo-semantic models seems to reappear during the late phase of the response, starting ∼400 ms after stimulus onset, when responses may reflect interactions between visual areas. These results show that DNNs fail to account for a significant component of variance in higher-level cortical dynamics, which is instead accounted for by visuo-semantic features, in particular object parts and basic categories. The peak of visuo-semantic model performance in higher-level areas (red vertical line) precedes the peak in intermediate areas (blue vertical line). This sequence of events aligns with the timing of possible feedback information flow from higher-level to intermediate areas (light gray rectangle and arrow) as reported in Kietzmann et al. (2019b). The shaded area around the lines shows the SEM across participants.

References

    1. Ahissar M, Hochstein S (2004) The reverse hierarchy theory of visual perceptual learning. Trends Cogn Sci 8:457–464. 10.1016/j.tics.2004.08.011 - DOI - PubMed
    1. Bankson B, Hebart M, Groen I, Baker C (2018) The temporal evolution of conceptual object representations revealed through models of behavior, semantics and deep neural networks. Neuroimage 178:172–182. 10.1016/j.neuroimage.2018.05.037 - DOI - PubMed
    1. Bar M (2003) A cortical mechanism for triggering top-down facilitation in visual object recognition. J Cogn Neurosci 15:600–609. 10.1162/089892903321662976 - DOI - PubMed
    1. Barbu A, Mayo D, Alverio J, Luo W, Wang C, Gutfreund D, Tenenbaum J, Katz B (2019) ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. Paper presented at the meeting of Advances in Neural Information Processing Systems, Vancouver, Canada, November.
    1. Bonner MF, Epstein RA (2018) Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Computat Biol 14:e1006111. - PMC - PubMed

Publication types

LinkOut - more resources