Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Dec;1505(1):55-78.
doi: 10.1111/nyas.14593. Epub 2021 Mar 22.

From convolutional neural networks to models of higher-level cognition (and back again)

Affiliations
Review

From convolutional neural networks to models of higher-level cognition (and back again)

Ruairidh M Battleday et al. Ann N Y Acad Sci. 2021 Dec.

Abstract

The remarkable successes of convolutional neural networks (CNNs) in modern computer vision are by now well known, and they are increasingly being explored as computational models of the human visual system. In this paper, we ask whether CNNs might also provide a basis for modeling higher-level cognition, focusing on the core phenomena of similarity and categorization. The most important advance comes from the ability of CNNs to learn high-dimensional representations of complex naturalistic images, substantially extending the scope of traditional cognitive models that were previously only evaluated with simple artificial stimuli. In all cases, the most successful combinations arise when CNN representations are used with cognitive models that have the capacity to transform them to better fit human behavior. One consequence of these insights is a toolkit for the integration of cognitively motivated constraints back into CNN training paradigms in computer vision and machine learning, and we review cases where this leads to improved performance. A second consequence is a roadmap for how CNNs and cognitive models can be more fully integrated in the future, allowing for flexible end-to-end algorithms that can learn representations from data while still retaining the structured behavior characteristic of human cognition.

Keywords: categorization; cognitive modeling; convolutional neural networks; similarity; vision.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview. Traditional studies have used simple artificial stimuli that can be mathematically represented unambiguously as the substrate for models of higher‐level cognition (left pathway). CNNs can be used to supply representations for more complex naturalistic images, which can be further modified to better reflect human judgments before being input into the same kinds of cognitive model (middle pathways)., , End‐to‐end models offer the opportunity to solve both of these problems simultaneously and learn a representation for naturalistic stimuli that satisfies the constraints inherent in higher‐level cognitive models (right pathway).
Figure 2
Figure 2
A basic CNN architecture used for image classification. (A and B) Convolutional filters are moved across the activations of layers below, outputting feature maps. (C) A typical CNN architecture, based on AlexNet. From a 32 × 32 RGB image, the first convolutional block learns weights for 32 feature maps, followed by two max pooling layers. The output is a vector of category probabilities; the category with the maximum of these values is taken as the output label. For computer vision tasks, the filters that are learned in the first few layers typically correspond to more general, low‐level image features, such as oriented Gabors and colored blobs. The deeper layers tend to correspond to more task‐specific, high‐level features, such as faces or human or animal figures., ,
Figure 3
Figure 3
Representative stimuli from seminal psychological studies of categorization. Top row: typical artificial stimuli representative of those used in traditional studies of cognition. Bottom row: the mathematical representation of these stimuli that are input into cognitive models. Reproduced from Ref. .
Figure 4
Figure 4
Transforming CNN representations using similarity judgments. (A) Representations of images derived from human similarity judgments using MDS exhibit meaningful variation and segregation (left panel). Using MDS to examine the similarity structure of raw CNN representations shows they fail to capture these relationships (center panel). Improving the fit of CNN representations to human similarity judgments recovers this structure (right panel). (B) Dendrograms of image representations also display meaningful hierarchical categorical structure (top panel) that is not present in raw CNN representations (middle panel) but that is recovered by modifying them using the learned similarity transformation (bottom panel). Reproduced from Ref. .
Figure 5
Figure 5
Increasing the flexibility of the linear transformation of CNN representations improves fit to human similarity judgments. As the constraints on the transformation matrix are relaxed (see legend, bottom to top), the accuracy of model predictions increases. Increasing the number of principal components used to represent the compressed CNN representations also improves performance and widens the gaps between model subtypes. Error bars represent ± 1 SEM over five cross‐validation folds. Reproduced from Ref. .
Figure 6
Figure 6
Exploring the effect of dimensionality reduction on modeling similarity judgments. Top row: CNN representations for a number of image datasets were reduced by using similarity judgments (left) or PCA (right). Performance using all representations and a simple dimensional reweighting are shown as a dashed line for each dataset. Middle row: fixing a small bottleneck size and projecting bottleneck representations using PCA allows interpretation of the information being encoded by the network and extracted for the similarity comparison. Bottom row: dendrograms for the animal dataset based on representations from bottleneck sizes of two and six show the CNN representation, and reduction captures similarity information according in a hierarchical manner. H, herps; B, birds; P, primates; R, rodents; WC, wild cats; G, grazers; E, dogs, bears, and large animals. Reproduced from Ref. .
Figure 7
Figure 7
The feature basis used for modeling categorizations of natural stimuli affects overall model performance more than categorization strategy. Top row: two‐dimensional linear discriminant analysis projections of the representations from each computer vision method. The feature bases across the x‐axis roughly track the development of computer vision: raw pixels, hand‐engineered features (HOG), the latent space of a generative network that uses convolutions (BiGAN), and a basic (AlexNet) and more advanced (DenseNet) CNN. Bottom row: categorization models using different prototype and exemplar strategies were trained on each of these feature bases, with model flexibility being more obviously related to overall model performance than categorization strategy (i.e., prototype or exemplar). Baselines were provided by taking the softmax probabilities from the final CNN layer as the similarity measurement. Reproduced from Ref. .
Figure 8
Figure 8
Improving the generalization abilities of CNNs using human uncertainty. As test images come from increasingly out‐of‐training‐sample distributions, CNNs trained on soft labels derived from human uncertainty increasingly outperform their traditional hard‐label counterparts (in terms of accuracy and loss). The difference in distributional benefits to using hard labels is reflected in the consistent benefits regarding second‐best accuracy scores. Reproduced from Ref. .
Figure 9
Figure 9
Deep categorization models learn category‐specific stimulus embeddings. Top and middle rows: t‐distributed stochastic neighbor embeddings of representations from a deep prototype (top) and deep GMM with 25 centers (bottom), with locations of prototypes and subprototypes marked, respectively. Bottom row: performance in the GMM increases with the number of centers until around 10−25, then asymptotes. Reproduced from Ref. .

Similar articles

Cited by

References

    1. Krizhevsky, A. , Sutskever I. & Hinton G.E.. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 1097–1105.
    1. Baxter, J. 2000. A model of inductive bias learning. J. Artif. Intell. Res. 12: 149–198.
    1. Russakovsky, O. et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115: 211–252.
    1. Duta, I.C. , Liu L., Zhu F. & Shao L.. 2020. Pyramidal convolution: rethinking convolutional neural networks for visual recognition. arXiv preprint arXiv:2006.11538.
    1. Lin, T.‐Y. et al. 2014. Microsoft COCO: common objects in context. In European Conference on Computer Vision 740–755.

Publication types

LinkOut - more resources