Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 23;118(8):e2011417118.
doi: 10.1073/pnas.2011417118.

An ecologically motivated image dataset for deep learning yields better models of human vision

Affiliations

An ecologically motivated image dataset for deep learning yields better models of human vision

Johannes Mehrer et al. Proc Natl Acad Sci U S A. .

Abstract

Deep neural networks provide the current best models of visual information processing in the primate brain. Drawing on work from computer vision, the most commonly used networks are pretrained on data from the ImageNet Large Scale Visual Recognition Challenge. This dataset comprises images from 1,000 categories, selected to provide a challenging testbed for automated visual object recognition systems. Moving beyond this common practice, we here introduce ecoset, a collection of >1.5 million images from 565 basic-level categories selected to better capture the distribution of objects relevant to humans. Ecoset categories were chosen to be both frequent in linguistic usage and concrete, thereby mirroring important physical objects in the world. We test the effects of training on this ecologically more valid dataset using multiple instances of two neural network architectures: AlexNet and vNet, a novel architecture designed to mimic the progressive increase in receptive field sizes along the human ventral stream. We show that training on ecoset leads to significant improvements in predicting representations in human higher-level visual cortex and perceptual judgments, surpassing the previous state of the art. Significant and highly consistent benefits are demonstrated for both architectures on two separate functional magnetic resonance imaging (fMRI) datasets and behavioral data, jointly covering responses to 1,292 visual stimuli from a wide variety of object categories. These results suggest that computational visual neuroscience may take better advantage of the deep learning framework by using image sets that reflect the human perceptual and cognitive experience. Ecoset and trained network models are openly available to the research community.

Keywords: computational neuroscience; computer vision; deep neural networks; ecological relevance; human visual system.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Ecoset overview. (A) Flow diagram depicting the steps taken during dataset creation. This includes category selection and curation as well as image processing (search/download, duplicate removal, and label-cleaning procedures). (B) Example images from the six categories with FCI (shown in decreasing order from left to right). (C) Superordinate category overview. (D) Distribution of the number of images per category. (E) Distribution of image sizes (log-transformed width and height).
Fig. 2.
Fig. 2.
Training on ecoset rather than ILSVRC 2012 improves the alignment between DNN representations and human HVC as well as with human perceptual similarity judgments. (A) Data for fMRI dataset 1. (A, middle row) Benefits of training on ecoset were true for both architectures tested (AlexNet, shown in red, as well as vNet, shown in blue). Lower bound of the noise ceiling shown as the lower edge of the gray bar, stars indicate significant differences at P < 0.01, Bonferroni corrected for the number of network layers. To estimate statistical significance, each network instance of a given architecture was correlated with data from each human participant. To summarize the performance of a network instance, the average match across all human individuals was computed. Based on these data, permutation tests were performed comparing network instances trained on either ecoset or ILSVRC. Error bars indicate 95% CI across network instances (see Materials and Methods for further details). (A, bottom row) Benefits of training on ecoset persist when controlling for the number of images and the number of categories in the two training datasets. (B) Effects obtained for fMRI dataset 1 replicate in a separate fMRI dataset (dataset 2). (C) DNNs trained on ecoset also exhibit better alignment with human perceptual similarity judgments (behavioral dataset, ecoset-trained network shown in black, ILSVRC 2012 in gray). (D) The model fit between HVC and human behavior exhibits a strong positive relationship (data for various vNet network layers shown as data points).
Fig. 3.
Fig. 3.
Comparing ecoset-trained DNNs to the state of the art. (A) Target RDMs from human HVC shown together with RDMs extracted from various deep neural network models (best layer selected for each with dataset 1 on the left and dataset 2 on the right). (B) Agreement with human HVC plotted against model parametric complexity. vNet and AlexNet v2, both trained on ecoset, significantly outperform state of the art DNN models pretrained on ILSVRC 2012 (DenseNet-169, VGG-19, and the original AlexNet). Error bars shown in blue and red indicate 95% CI.
Fig. 4.
Fig. 4.
vNet design and statistical procedures. (A) The vNet architecture was designed such that the effective kernel sizes across its layers approximate the progressive increase in average RF sizes in the central 3° of visual angle along human ventral stream areas. (B) To compare the representations learned by DNNs and the ones found in human HVC, all network instances were shown the same stimuli as the human observers to extract their activation patterns. Based on these patterns, RDMs were computed, one per layer and network instance. These dissimilarity matrices were then compared to the HVC RDMs of each individual participant using Spearman’s correlation. We used the average of the individual participant correlations to estimate the predictive performance of a given network instance and layer (see section Statistical Comparisons between Human IT and DNN Representations for details). The data noise ceiling was computed by comparing individual participant RDMs to the average RDM of all remaining participants, again using a Spearman’s correlation.

References

    1. Khaligh-Razavi S.-M., Kriegeskorte N., Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014). - PMC - PubMed
    1. Güçlü U., van Gerven M. A. J., Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015). - PMC - PubMed
    1. Schrimpf M., et al. ., Brain-Score: Which artificial neural network for object recognition is most brain-like? bioRxiv [Preprint] (2020). 10.1101/407007. (Accessed 17 October 2020). - DOI
    1. Russakovsky O., et al. ., ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
    1. Weiner K. S., Grill-Spector K., Neural representations of faces and limbs neighbor in human high-level visual cortex: Evidence for a new organization principle. Psychol. Res. 77, 74–97 (2013). - PMC - PubMed

Publication types

LinkOut - more resources