Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov 6;10(11):e1003915.
doi: 10.1371/journal.pcbi.1003915. eCollection 2014 Nov.

Deep supervised, but not unsupervised, models may explain IT cortical representation

Affiliations

Deep supervised, but not unsupervised, models may explain IT cortical representation

Seyed-Mahdi Khaligh-Razavi et al. PLoS Comput Biol. .

Abstract

Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the internal representations of computational models can explain the IT representation. Here we investigate a wide range of computational model representations (37 in total), testing their categorization performance and their ability to account for the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Representational dissimilarity matrices for IT and for the seven best-fitting not-strongly-supervised models.
The IT RDMs (black frames) for human (A) and monkey (B) and the seven most highly correlated model RDMs (excluding the representations in the strongly supervised deep convolutional network). The model RDMs are ordered from left to right and top to bottom by their correlation with the respective IT RDM. These are the seven most higly correlated RDMs among the 27 models that were not strongly supervised and their combination model (combi27). Biologically motivated models are in black, computer-vision models are in gray. The number below each RDM is the Kendall τA correlation coefficient between the model RDM and the respective IT RDM. All correlations are statistically significant. For statistical inference, see Figure 2. For model abbreviations and RDM-correlation p values, see Table 1. For other brain ROIs (i.e. LOC, PPA, FFA, EVC) see Figure S1 and Table 1. The RDMs here are 96×96, including the four stimuli we did not have monkey data for. The corresponding rows and columns are shown in blue in the mIT RDM and were ignored in the RDM comparisons.
Figure 2
Figure 2. The not-strongly-supervised models fail to fully explain the IT data.
The bars show the Kendall-τ A RDM correlations between the not-strongly-supervised models and IT for human (A) and monkey (B). The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; ns: not significant, p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). Most models explain a small, but significant portion of the variance of the IT representational geometry. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the not-strongly-supervised models reaches the noise ceiling. The noise ceiling could not be estimated for mIT, because the available data were from only two animals. Models with the subscript ‘UT’ are unsupervised trained models, models with the subscript ‘ST’ are supervised trained models, and others without a subscript are untrained models. Note that the supervised models included here were “weakly supervised”, i.e. with small numbers (884) of category-labeled images. Biologically motivated models are set in black font, and computer-vision models are set in gray font.
Figure 3
Figure 3. IT-like categorical structure is not apparent in any of the not-strongly-supervised models.
Brain and model RDMs are shown in the left columns of each panel. We used a linear combination of category-cluster RDMs (Figure S5) to model the categorical structure (least-squares fit). The categories modeled were animate, inanimate, face, human face, non-human face, body, human body, non-human body, natural inanimates, and artificial inanimates. The fitted linear-combination of category-cluster RDMs is shown in the middle columns. This descriptive visualization shows to what extent different categorical divisions are prominent in each RDM. The residual RDMs of the fits are shown in the right column. For statistical inference, see Figure 4.
Figure 4
Figure 4. The not-strongly-supervised models are less categorical than IT.
Categoricality was measured using a categoricality index (vertical axis) for each model and brain RDM. The categoricality index is defined as the proportion of RDM variance explained by the category-cluster model (Figure S5), i.e. the squared correlation between the fitted category-cluster model and the RDM it is fitted to. Bars show the categoricality index for each of the not-strongly-supervised models. The blue (gray) line shows the categoricality index for hIT (mIT). Error bars show 95%-confidence intervals of the categoricality index estimates for the models. The 95%-confidence intervals for hIT and mIT are shown by the blue and gray shaded regions, respectively. Significant categoricality indices are marked by stars underneath the bars (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). Error bars are based on bootstrapping of the stimulus set, and the p-values are obtained by category label randomization test. Significant differences between the categoricality indices of each model and hIT (inference by bootstrap resampling of the stimuli) are indicated by blue vertical arrows (p<0.05, Bonferroni-adjusted for 28 tests). The corresponding inferential comparisons for mIT are indicated by gray vertical arrows. Categoricality is significantly greater in hIT and mIT than in any of the 28 models. This analysis is based on equating the noise level in the models with that of hIT (Materials and Methods). Similar results obtain for a conservative inferential analysis comparing the categoricality of the noise-less models with that of the noisy estimates for hIT and mIT (Figure S9).
Figure 5
Figure 5. Remixing and reweighting features of the not-strongly supervised models does not explain IT.
In order to build an IT-like representation, we attempted to remix the features to strengthen relevant categorical divisions. We trained three linear SVM classifiers (for animate/inanimate, face/nonface, and body/nonbody) on the combi27 features using 884 training images (separate from the set we had brain data for). RDMs for the resulting SVM decision values for the 92 images presented to humans and monkeys are shown at the top. The Kendall-τ A RDM correlations with hIT and mIT are stated underneath the RDMs. The RDM correlations are low, but all three are statistically significant (p<0.05). We further attempted to create an IT-like representation as a reweighted combination of the models. We fitted one weight for each of the 27 not-strongly-supervised models, the combi27 model, and the three SVM decision values. The weights were fitted by non-negative least squares, so as to minimize the sum of squared deviations between the RDM of the weighted combination of the features and the hIT RDM. The resulting weights are shown in the second row. Error bars indicate 95%-confidence intervals obtained by bootstrap resampling of the stimulus set. The resulting IT-geometry-supervised RDM is shown at the bottom (center) in juxtaposition to hIT (left) and mIT (right). Importantly, the RDM was obtained by cross-validation to avoid overfitting to the image set (Materials and Methods). The RDMs here are 92×92, excluding the four stimuli that we did not have monkey data for.
Figure 6
Figure 6. RDMs of all layers of the strongly supervised deep convolutional network.
RDMs for all layers of the deep convolutional network (Krizhevsky et al. 2012) ref are shown for the set of the 96 images (L1: layer 1 to L7: layer 7). Kendall-τ A RDM correlations of the models with hIT and mIT are stated underneath each RDM. All correlations are statistically significant. For inferential comparisons to IT and other regions, see Figure 7 and Table 2, respectively.
Figure 7
Figure 7. The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data.
The bars show the Kendall-τA RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). As we ascend the layers of the deep network, model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations.
Figure 8
Figure 8. IT-like categorical structure emerges across the layers of the deep supervised model, culminating in the IT-geometry-supervised layer.
Descriptive category-clustering analysis as in Figure 3, but for the deep supervised network. We used a linear combination of category-cluster RDMs (Figure S5) to model the categorical structure. The fitted linear-combination of category-cluster RDMs is shown in the middle columns. This descriptive visualization shows to what extent different categorical divisions are prominent in each layer of the deep supervised model. The layers show some of the categorical divisions emerging. However, remixing of the features (linear SVM readout) is required to emphasize the categorical divisions to a degree that is similar to IT. The final IT-geometry-supervised layer (weighted combination of layers and SVM discriminants) has a categorical structure that is very similar to IT. Overfitting to the image set was avoided by crossvalidation. For statistical inference, see Figure 9.
Figure 9
Figure 9. The layers of the deep supervised model are less categorical than IT, but remixing and reweighting achieves IT-level categoricality.
Bars show the categoricality index for each layer of the deep convolutional network and for the IT-geometry-supervised layer. For conventions and for definition of the categoricality index, see Figure 4. Error bars and shaded regions indicate 95%-confidence intervals. Significant Categoricality indices are indicated by stars underneath the bars (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). Significant differences between the categoricality index of each model and the hIT categoricality index are indicated by blue vertical arrows (p<0.05, Bonferroni-adjusted for 9 tests). The corresponding inferential comparisons for mIT are indicated by gray vertical arrows. Categoricality is significantly greater in hIT and mIT than in any of the internal layers of the deep convolutional network. However, the IT-geometry-supervised layer (remixed and reweighted) achieves a categoricality similar to (and not significantly different from) IT. This analysis is based on equating the noise level in the models with that of hIT (Materials and Methods). Similar results obtain for a conservative inferential analysis comparing the categoricality of the noise-less models with that of the noisy estimates for hIT and mIT (Figure S10).
Figure 10
Figure 10. Remixing and reweighting features of the deep supervised network achieves an IT-like representational geometry.
All analyses and conventions here are analogous to Figure 5, but applied to the strongly supervised deep convolutional network, rather than to the not-strongly supervised models. Remixing the features of layer 7 by fitting linear SVMs (separate set of training images) for the major categorical divisions (animate/inanimate, face/nonface, and body/nonbody) helped account for the categorical clusters in IT. The Kendall-τ A RDM correlations between the SVM decision values and IT (stated underneath the RDMs in the top row) are statistically significant (p<0.05). For the deep convolutional network used here, feature remixing accounted for the animate/inanimate division of IT. We attempted to create an IT-like representation as a reweighted combination of the layers of the deep network and the SVM decision values. We fitted one weight for each of the layers and one weight for each of the three decision values. The bar graph in the middle row shows the weights, with 95%-confidence intervals obtained by bootstrap resampling of the stimulus set. As before, the weights were fitted using non-negative least squares to minimize the sum of squared deviations between the RDM of the weighted combination and the hIT RDM. The resulting IT-geometry-supervised RDM (bottom row, center) is very similar to the RDMs of hIT (left) and mIT (right). The τ A RDM correlation between the fitted model and IT is about equal for monkey IT (0.40) and human IT (0.38). Both of these RDM correlations are higher than the RDM correlation between hIT and mIT, reflecting the effect of noise on the empirical RDM estimates. As in Figure 5, the fitted model RDM was obtained by cross-validation to avoid overfitting to the image set.
Figure 11
Figure 11. Animate/inanimate categorization accuracy for all models.
Each dark blue bar shows the categorization accuracy of a linear SVM applied to one of the computational model representations. Categorization accuracy for each model was estimated by 12-fold crossvalidation on the 96 stimuli. To assess whether categorization accuracy was above chance level, we performed a permutation test, in which we retrained the SVMs on (category-orthogonalized) 10,000 random dichotomies among the stimuli. Light blue bars show the average model categorization accuracy for random label permutations. Categorization performance was significantly greater than chance for most models (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). The deep convolutional network model (final fully connected layer 7) has the highest animate/inanimate categorization performance (96%). The combi27 has the second highest performance (76%).
Figure 12
Figure 12. Model representations resembling IT afford better categorization accuracy.
A model's IT-resemblance (measured by the RDM correlation between IT and model) predicts its categorization accuracy (animate/inanimate). This holds for both human-IT resemblance (top) and monkey-IT resemblance (bottom). The substantial positive correlation between IT-resemblance and categorization accuracy could reflect the categorical clustering of IT (left panels). However, the within-category RDM correlation between a model and IT also predicts model categorization accuracy (right panels). Each panel shows the least-squares fit (gray line) and the Spearman rank correlation r (* p<0.05, ** p<0.01, *** p<0.001, **** p<0.0001). Each circle shows one of the models. Numbers indicate the model (see Table 1 for model numbering). Different layers of the deep supervised convolutional network are indicated by colored labels “L1” (layer 1) to “L7” (layer 7). The deep model's layers are color-coded from light blue to light red (from lower to higher layers). Computer vision models are shown by gray circles; biologically motivated models are shown by black circles. The transparent horizontal and vertical rectangles cover non-significant ranges along each axis.

References

    1. Desimone R, Albright TD, Gross CG, Bruce C (1984) Stimulus-selective properties of inferior temporal neurons in the macaque. J Neurosci 4: 2051–2062. - PMC - PubMed
    1. Gross CG (1994) How Inferior Temporal Cortex Became a Visual Area. Cereb Cortex 4: 455–469 10.1093/cercor/4.5.455 - DOI - PubMed
    1. Tanaka K (1996) Inferotemporal Cortex and Object Vision. Annu Rev Neurosci 19: 109–139 10.1146/annurev.ne.19.030196.000545 - DOI - PubMed
    1. Hung CP, Kreiman G, Poggio T, DiCarlo JJ (2005) Fast Readout of Object Identity from Macaque Inferior Temporal Cortex. Science 310: 863–866 10.1126/science.1117593 - DOI - PubMed
    1. Zoccolan D, Kouh M, Poggio T, DiCarlo JJ (2007) Trade-Off between Object Selectivity and Tolerance in Monkey Inferotemporal Cortex. J Neurosci 27: 12292–12307 10.1523/JNEUROSCI.1897-07.2007 - DOI - PMC - PubMed

Publication types