. 2014 Dec 18;10(12):e1003963.

doi: 10.1371/journal.pcbi.1003963. eCollection 2014 Dec.

Deep neural networks rival the representation of primate IT cortex for core visual object recognition

Charles F Cadieu¹, Ha Hong², Daniel L K Yamins¹, Nicolas Pinto¹, Diego Ardila¹, Ethan A Solomon¹, Najib J Majaj¹, James J DiCarlo¹

Affiliations

¹ Department of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
² Department of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America; Harvard-MIT Division of Health Sciences and Technology, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.

PMID: 25521294
PMCID: PMC4270441
DOI: 10.1371/journal.pcbi.1003963

Deep neural networks rival the representation of primate IT cortex for core visual object recognition

Charles F Cadieu et al. PLoS Comput Biol. 2014.

. 2014 Dec 18;10(12):e1003963.

doi: 10.1371/journal.pcbi.1003963. eCollection 2014 Dec.

Authors

Charles F Cadieu¹, Ha Hong², Daniel L K Yamins¹, Nicolas Pinto¹, Diego Ardila¹, Ethan A Solomon¹, Najib J Majaj¹, James J DiCarlo¹

Affiliations

¹ Department of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
² Department of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America; Harvard-MIT Division of Health Sciences and Technology, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.

PMID: 25521294
PMCID: PMC4270441
DOI: 10.1371/journal.pcbi.1003963

Abstract

The primate visual system achieves remarkable visual object recognition performance even in brief presentations, and under changes to object exemplar, geometric transformations, and background variation (a.k.a. core visual object recognition). This remarkable performance is mediated by the representation formed in inferior temporal (IT) cortex. In parallel, recent advances in machine learning have led to ever higher performing models of object recognition using artificial deep neural networks (DNNs). It remains unclear, however, whether the representational performance of DNNs rivals that of the brain. To accurately produce such a comparison, a major difficulty has been a unifying metric that accounts for experimental limitations, such as the amount of noise, the number of neural recording sites, and the number of trials, and computational limitations, such as the complexity of the decoding classifier and the number of classifier training examples. In this work, we perform a direct comparison that corrects for these experimental limitations and computational considerations. As part of our methodology, we propose an extension of "kernel analysis" that measures the generalization accuracy as a function of representational complexity. Our evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task. Furthermore, we show that models that perform well on measures of representational performance also perform well on measures of representational similarity to IT, and on measures of predicting individual IT multi-unit responses. Whether these DNNs rely on computational mechanisms similar to the primate visual system is yet to be determined, but, unlike all previous bio-inspired models, that possibility cannot be ruled out merely on representational performance grounds.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Example images used to measure object category recognition performance.**
Two of the 1960 tested images are shown from the categories Cars, Fruits, and Animals (we also tested the categories Planes, Chairs, Tables, and Faces). Variability within each category consisted of changes to object exemplar (e.g. 7 different types of Animals), geometric transformations due to position, scale, and rotation/pose, and changes to background (each background image is unique).

**Figure 2. Kernel analysis curves of model representations.**
Precision, one minus loss (), is plotted against complexity, the inverse of the regularization parameter (). Shaded regions indicate the standard deviation of the measurement over image set randomizations, which are often smaller than the line thickness. The Zeiler & Fergus 2013, Krizhevsky et al. 2012 and HMO models are all hierarchical deep neural networks. HMAX is a model of the ventral visual stream and the V1-like and V2-like models attempt to replicate response properties of visual areas V1 and V2, respectively. These analyses indicate that the task we are measuring proves difficult for V1-like and V2-like models, with these models barely moving from 0.0 precision for all levels of complexity. Furthermore, the HMAX model, which has previously been shown to perform relatively well on object recognition tasks, performs only marginally better. Each of the remaining deep neural network models performs drastically better, with the Zeiler & Fergus 2013 model performing best for all levels of complexity. These results indicate that the visual object recognition task we evaluate is computationally challenging for all but the latest deep neural networks.

formula image — **Figure 2. Kernel analysis curves of model representations.**
Precision, one minus loss (), is plotted against complexity, the inverse of the regularization parameter (). Shaded regions indicate the standard deviation of the measurement over image set randomizations, which are often smaller than the line thickness. The Zeiler & Fergus 2013, Krizhevsky et al. 2012 and HMO models are all hierarchical deep neural networks. HMAX is a model of the ventral visual stream and the V1-like and V2-like models attempt to replicate response properties of visual areas V1 and V2, respectively. These analyses indicate that the task we are measuring proves difficult for V1-like and V2-like models, with these models barely moving from 0.0 precision for all levels of complexity. Furthermore, the HMAX model, which has previously been shown to perform relatively well on object recognition tasks, performs only marginally better. Each of the remaining deep neural network models performs drastically better, with the Zeiler & Fergus 2013 model performing best for all levels of complexity. These results indicate that the visual object recognition task we evaluate is computationally challenging for all but the latest deep neural networks.

**Figure 3. Kernel analysis curves of sample and noise matched neural and model representations.**
Plotting conventions are the same as in Fig. 2. Multi-unit analysis is presented in panel A and single-unit analysis in B. Note that the model representations have been modified such that they are both subsampled and noisy versions of those analyzed in Fig. 2 and this modification is indicated by the symbol for noise matched to the multi-unit IT cortex sample and by the symbol for noise matched to the single-unit IT cortex sample. To correct for sampling bias, the multi-unit analysis uses 80 samples, either 80 neural multi-units from V4 or IT cortex, or 80 features from the model representations, and the single-unit analysis uses 40 samples. To correct for experimental and intrinsic neural noise, we added noise to the subsampled model representation (no additional noise is added to the neural representations) that is commensurate to the observed noise from the IT measurements. Note that we observed similar noise between the V4 and IT Cortex samples and we do not attempt to correct the V4 cortex sample of the noise observed in the IT cortex sample. We observed substantially higher noise levels in IT single-unit recordings than multi-unit recordings due to both higher trial-to-trial variability and more trials for the multi-unit recordings. All model representations suffer decreases in accuracy after correcting for sampling and adding noise (compare absolute precision values to Fig. 2). All three deep neural networks perform significantly better than the V4 cortex sample. For the multi-unit analysis (A), IT cortex sample achieves high precision and is only matched in performance by the Zeiler & Fergus 2013 representation. For the single-unit analysis (B), both the Krizhevsky et al. 2012 and the Zeiler & Fergus 2013 representations surpass the IT representational performance.

**Figure 4. Effect of sampling the neural and noise-corrected model representations.**
We measure the area-under-the-curve of the kernel analysis measurement as we change the number of neural sites (for neural representations), or the number of features (for model representations). Measured samples are indicated by filled symbols and measured standard deviations indicated by error bars. Multi-unit analysis is shown in panel A and single-unit analysis in B. The model representations are noise corrected by adding noise that is matched to the IT multi-unit measurements (A, as indicated by the symbol) or single-unit measurements (B, as indicated by the symbol). For the multi-unit analysis, the Zeiler & Fergus 2013 representation rivals the IT cortex representation over our measured sample. For the single-unit analysis, the Krizhevsky et al. 2012 representation rivals the IT cortex representation for low number of features and slightly surpasses it for higher number of features. The Zeiler & Fergus 2013 representation surpasses the IT cortex representation over our measured sample.

**Figure 5. Linear-SVM generalization performance of neural and model representations.**
Testing set classification accuracy averaged over 10 randomly-sampled test sets is plotted and error bars indicate standard deviation over the 10 random samples. Chance performance is ∼14.3%. V4 and IT Cortex Multi-Unit Sample are the values measured directly from the neural samples. Following the analysis in Fig. 3A, the model representations have been modified such that they are both subsampled and have noise added that is matched to the observed IT multi-unit noise. We indicate this modification by the symbol. Both model and neural representations are subsampled to 80 multi-unit samples or 80 features. Mirroring the results using kernel analysis, the IT cortex multi-unit sample achieves high generalization accuracy and is only matched in performance by the Zeiler & Fergus 2013 representation.

**Figure 6. Neural and model representation predictions of IT multi-unit responses.**
A) The median predictions of IT multi-unit responses averaged over 10 train/test splits is plotted for model representations and V4 multi-units. Error bars indicate standard deviation over the 10 train/test splits. Predictions are normalized to correct for trial-to-trial variability of the IT multi-unit recording and calculated as percentage of explained, explainable variance. The HMO, Krizhevsky et al. 2012, and Zeiler & Fergus 2013 representations achieve IT multi-unit predictions that are comparable to the predictions produced by the V4 multi-unit representation. B) The mean predictions over the 10 train/test splits for the V4 cortex multi-unit sample and the Zeiler & Fergus 2013 DNN are plotted against each other for each IT multi-unit site.

**Figure 7. Object-level representational similarity analysis comparing model and neural representations to the IT multi-unit representation.**
A) Following the proposed analysis in , the object-level dissimilarity matrix for the IT multi-unit representation is compared to the matrices computed from the model representations and from the V4 multi-unit representation. Each bar indicates the similarity between the corresponding representation and the IT multi-unit representation as measured by the Spearman correlation between dissimilarity matrices. Error bars indicate standard deviation over 10 splits. The IT Cortex Split-Half bar indicates the deviation measured by comparing half of the multi-unit sites to the other half, measured over 50 repetitions. The V1-like, V2-like, and HMAX representations are highly dissimilar to IT cortex. The HMO representation produces comparable deviations from IT as the V4 multi-unit representation while the Krizhevsky et al. 2012 and Zeiler & Fergus 2013 representations fall in-between the V4 representation and the IT cortex split-half measurement. The representations with an appended “+ IT-fit” follow the methodology in , which first predicts IT multi-unit responses from the model representation and then uses these predictions to form a new representation (see text). B) Depictions of the object-level RDMs for select representations. Each matrix is ordered by object category (animals, cars, chairs, etc.) and scaled independently (see color bar). For the “+ IT-fit” representations, the feature for each image was averaged across testing set predictions before computing the RDM (see Methods).

See this image and copyright information in PMC

References

1. Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381: 520–522. - PubMed
1. Fabre-Thorpe M, Richard G, Thorpe SJ (1998) Rapid categorization of natural images by rhesus monkeys. Neuroreport 9: 303–308. - PubMed
1. Keysers C, Xiao D, F 246 ldi 225 k P Perrett D (2001) The Speed of Sight. Journal of Cognitive Neuroscience 13: 90–101. - PubMed
1. Potter MC, Wyble B, Hagmann CE, McCourt ES (2013) Detecting meaning in RSVP at 13 ms per picture. Attention, Perception, & Psychophysics 76: 270–279. - PubMed
1. Andrews TJ, Coppola DM (1999) Idiosyncratic characteristics of saccadic eye movements when viewing different visual environments. Vision Research 39: 2947–2953. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Deep neural networks rival the representation of primate IT cortex for core visual object recognition

Affiliations

Deep neural networks rival the representation of primate IT cortex for core visual object recognition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials