. 2021 Feb 23;118(8):e2011417118.

doi: 10.1073/pnas.2011417118.

An ecologically motivated image dataset for deep learning yields better models of human vision

Johannes Mehrer¹, Courtney J Spoerer¹, Emer C Jones¹, Nikolaus Kriegeskorte², Tim C Kietzmann^{3

4}

Affiliations

¹ MRC Cognition and Brain Sciences Unit, University of Cambridge, CB2 7EF Cambridge, United Kingdom.
² Department of Psychology, Zuckerman Institute, Columbia University, New York, NY 10027.
³ MRC Cognition and Brain Sciences Unit, University of Cambridge, CB2 7EF Cambridge, United Kingdom; t.kietzmann@donders.ru.nl.
⁴ Donders Institute for Brain, Cognition and Behaviour, Radboud University, 6525 XZ Nijmegen, Netherlands.

PMID: 33593900
PMCID: PMC7923360
DOI: 10.1073/pnas.2011417118

An ecologically motivated image dataset for deep learning yields better models of human vision

Johannes Mehrer et al. Proc Natl Acad Sci U S A. 2021.

. 2021 Feb 23;118(8):e2011417118.

doi: 10.1073/pnas.2011417118.

Authors

Johannes Mehrer¹, Courtney J Spoerer¹, Emer C Jones¹, Nikolaus Kriegeskorte², Tim C Kietzmann^{3

4}

Affiliations

¹ MRC Cognition and Brain Sciences Unit, University of Cambridge, CB2 7EF Cambridge, United Kingdom.
² Department of Psychology, Zuckerman Institute, Columbia University, New York, NY 10027.
³ MRC Cognition and Brain Sciences Unit, University of Cambridge, CB2 7EF Cambridge, United Kingdom; t.kietzmann@donders.ru.nl.
⁴ Donders Institute for Brain, Cognition and Behaviour, Radboud University, 6525 XZ Nijmegen, Netherlands.

PMID: 33593900
PMCID: PMC7923360
DOI: 10.1073/pnas.2011417118

Abstract

Deep neural networks provide the current best models of visual information processing in the primate brain. Drawing on work from computer vision, the most commonly used networks are pretrained on data from the ImageNet Large Scale Visual Recognition Challenge. This dataset comprises images from 1,000 categories, selected to provide a challenging testbed for automated visual object recognition systems. Moving beyond this common practice, we here introduce ecoset, a collection of >1.5 million images from 565 basic-level categories selected to better capture the distribution of objects relevant to humans. Ecoset categories were chosen to be both frequent in linguistic usage and concrete, thereby mirroring important physical objects in the world. We test the effects of training on this ecologically more valid dataset using multiple instances of two neural network architectures: AlexNet and vNet, a novel architecture designed to mimic the progressive increase in receptive field sizes along the human ventral stream. We show that training on ecoset leads to significant improvements in predicting representations in human higher-level visual cortex and perceptual judgments, surpassing the previous state of the art. Significant and highly consistent benefits are demonstrated for both architectures on two separate functional magnetic resonance imaging (fMRI) datasets and behavioral data, jointly covering responses to 1,292 visual stimuli from a wide variety of object categories. These results suggest that computational visual neuroscience may take better advantage of the deep learning framework by using image sets that reflect the human perceptual and cognitive experience. Ecoset and trained network models are openly available to the research community.

Keywords: computational neuroscience; computer vision; deep neural networks; ecological relevance; human visual system.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
Ecoset overview. (A) Flow diagram depicting the steps taken during dataset creation. This includes category selection and curation as well as image processing (search/download, duplicate removal, and label-cleaning procedures). (B) Example images from the six categories with FCI (shown in decreasing order from left to right). (C) Superordinate category overview. (D) Distribution of the number of images per category. (E) Distribution of image sizes (log-transformed width and height).

**Fig. 2.**
Training on ecoset rather than ILSVRC 2012 improves the alignment between DNN representations and human HVC as well as with human perceptual similarity judgments. (A) Data for fMRI dataset 1. (A, middle row) Benefits of training on ecoset were true for both architectures tested (AlexNet, shown in red, as well as vNet, shown in blue). Lower bound of the noise ceiling shown as the lower edge of the gray bar, stars indicate significant differences at P < 0.01, Bonferroni corrected for the number of network layers. To estimate statistical significance, each network instance of a given architecture was correlated with data from each human participant. To summarize the performance of a network instance, the average match across all human individuals was computed. Based on these data, permutation tests were performed comparing network instances trained on either ecoset or ILSVRC. Error bars indicate 95% CI across network instances (see *Materials and Methods* for further details). (A, bottom row) Benefits of training on ecoset persist when controlling for the number of images and the number of categories in the two training datasets. (B) Effects obtained for fMRI dataset 1 replicate in a separate fMRI dataset (dataset 2). (C) DNNs trained on ecoset also exhibit better alignment with human perceptual similarity judgments (behavioral dataset, ecoset-trained network shown in black, ILSVRC 2012 in gray). (D) The model fit between HVC and human behavior exhibits a strong positive relationship (data for various vNet network layers shown as data points).

**Fig. 3.**
Comparing ecoset-trained DNNs to the state of the art. (A) Target RDMs from human HVC shown together with RDMs extracted from various deep neural network models (best layer selected for each with dataset 1 on the left and dataset 2 on the right). (B) Agreement with human HVC plotted against model parametric complexity. vNet and AlexNet v2, both trained on ecoset, significantly outperform state of the art DNN models pretrained on ILSVRC 2012 (DenseNet-169, VGG-19, and the original AlexNet). Error bars shown in blue and red indicate 95% CI.

**Fig. 4.**
vNet design and statistical procedures. (A) The vNet architecture was designed such that the effective kernel sizes across its layers approximate the progressive increase in average RF sizes in the central 3° of visual angle along human ventral stream areas. (B) To compare the representations learned by DNNs and the ones found in human HVC, all network instances were shown the same stimuli as the human observers to extract their activation patterns. Based on these patterns, RDMs were computed, one per layer and network instance. These dissimilarity matrices were then compared to the HVC RDMs of each individual participant using Spearman’s correlation. We used the average of the individual participant correlations to estimate the predictive performance of a given network instance and layer (see section *Statistical Comparisons between Human IT and DNN Representations* for details). The data noise ceiling was computed by comparing individual participant RDMs to the average RDM of all remaining participants, again using a Spearman’s correlation.

See this image and copyright information in PMC

References

1. Khaligh-Razavi S.-M., Kriegeskorte N., Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014). - PMC - PubMed
1. Güçlü U., van Gerven M. A. J., Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015). - PMC - PubMed
1. Schrimpf M., et al. ., Brain-Score: Which artificial neural network for object recognition is most brain-like? bioRxiv [Preprint] (2020). 10.1101/407007. (Accessed 17 October 2020). - DOI
1. Russakovsky O., et al. ., ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
1. Weiner K. S., Grill-Spector K., Neural representations of faces and limbs neighbor in human high-level visual cortex: Evidence for a new organization principle. Psychol. Res. 77, 74–97 (2013). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

BB/M011194/1/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An ecologically motivated image dataset for deep learning yields better models of human vision

Affiliations

An ecologically motivated image dataset for deep learning yields better models of human vision

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources