Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 1;44(5):2018-2038.
doi: 10.1002/hbm.26189. Epub 2023 Jan 13.

Neural representations of the perception of handwritten digits and visual objects from a convolutional neural network compared to humans

Affiliations

Neural representations of the perception of handwritten digits and visual objects from a convolutional neural network compared to humans

Juhyeon Lee et al. Hum Brain Mapp. .

Abstract

We investigated neural representations for visual perception of 10 handwritten digits and six visual objects from a convolutional neural network (CNN) and humans using functional magnetic resonance imaging (fMRI). Once our CNN model was fine-tuned using a pre-trained VGG16 model to recognize the visual stimuli from the digit and object categories, representational similarity analysis (RSA) was conducted using neural activations from fMRI and feature representations from the CNN model across all 16 classes. The encoded neural representation of the CNN model exhibited the hierarchical topography mapping of the human visual system. The feature representations in the lower convolutional (Conv) layers showed greater similarity with the neural representations in the early visual areas and parietal cortices, including the posterior cingulate cortex. The feature representations in the higher Conv layers were encoded in the higher-order visual areas, including the ventral/medial/dorsal stream and middle temporal complex. The neural representations in the classification layers were observed mainly in the ventral stream visual cortex (including the inferior temporal cortex), superior parietal cortex, and prefrontal cortex. There was a surprising similarity between the neural representations from the CNN model and the neural representations for human visual perception in the context of the perception of digits versus objects, particularly in the primary visual and associated areas. This study also illustrates the uniqueness of human visual perception. Unlike the CNN model, the neural representation of digits and objects for humans is more widely distributed across the whole brain, including the frontal and temporal areas.

Keywords: convolutional neural network; functional magnetic resonance imaging; handwritten digits; representational similarity analysis; visual objects; visual perception.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest regarding this study, including financial, consultant, institutional, or other relationships. The sponsor was not involved in the study design, data collection, analysis or interpretation of the data, manuscript preparation, or the decision to submit for publication.

Figures

FIGURE 1
FIGURE 1
Visual perception of handwritten digits and visual objects under two representational similarity analysis (RSA) scenarios using human brain activations, a convolutional neural network (CNN), and human visual category/class perception. (a) Construction of a neural representational dissimilarity matrix (RDM) consisting of neural activations measured for the 10 handwritten digits and six objects images via fMRI. (b) Construction of (i) an RDM for each layer of the CNN trained to classify the 16 classes and (ii) an RDM that encodes the category/class visual perception of humans. (c) RSA using the neural RDM with (i) the RDM for the CNN model and (ii) the RDM for the category/class visual perception of humans. Fc, fully connected layer; Output, output layer.
FIGURE 2
FIGURE 2
Experimental paradigm for fMRI data acquisition. Ten handwritten digits and six visual objects are presented as visual stimuli for the image modality trials. Sound waveforms corresponding to the visual stimuli are presented as sound stimuli for the sound modality trials. In the image + sound modality trials, images with corresponding sound stimuli are presented to participants as a multimodal condition. Participants are instructed to press a button whenever they see and/or hear the “0” digit to maintain their alertness throughout the experiment. Please refer to the Sections 2.4 and 2.5 for details
FIGURE 3
FIGURE 3
Evaluation of the trained CNN model. (a) Estimated input patterns are obtained using the activation maximization (AM) approach, in which the estimated input pattern indicates the most representative patterns for the corresponding convolutional layer (Conv) filters or nodes of the fully connected (Fc) layer. For the visualization, Conv filters and nodes at the Fc layer are randomly selected. (b) Estimated input patterns from the AM applied to each of the 16 output nodes. Output, output layer.
FIGURE 4
FIGURE 4
t‐SNE plots of the feature representations for each CNN layer across the 16 classes of handwritten digit and visual object images. Conv, convolutional layer; Fc, fully connected layer; Output, output layer.
FIGURE 5
FIGURE 5
CNN layer assignment map and bar graphs indicating the similarity between the neural activations and feature representations for each of the CNN layers. (a) Layer assignment map across all CNN layers. (b) Assignment map for the lower Conv layers (i.e., Conv 1 and 2) and bar graphs of the similarity scores for the ROIs obtained from each of the CNN layers (significant cases, corrected p < 0.05 using 5000 random permutations, are color‐coded; error bars indicate the standard error across participants). (c) Assignment map and bar graphs of the similarity scores for the intermediate conv layers. (d) Results for the higher conv layers. (e) Results for the classification layers. A, anterior; CNN, convolutional neural networks; Conv, convolutional layer; DLPFC, dorsolateral prefrontal cortex; Dorsal, dorsal stream visual cortex; EAC, early auditory cortex; EVC, early visual cortex; Fc, fully connected layer; I, inferior; IFC, inferior frontal cortex; IPC, inferior parietal cortex; L, left; M1/S1, primary motor cortex and primary somatosensory cortex; MCC, middle cingulate cortex; MPFC, medial prefrontal cortex; MT+/LOC, middle temporal (MT) complex and its neighboring visual areas including lateral occipital (LO) complex; MVOcC, medioventral occipital cortex; OFC, orbitofrontal cortex; Output, output layer; P, posterior; PCC, posterior cingulate cortex; POC/FOC, posterior opercular cortex and frontal opercular cortex; R, right; S, superior; SPC, superior parietal cortex; TPOJ/AAC, temporo‐parieto‐occipital junction and auditory association cortex; Ventral, ventral stream visual cortex.
FIGURE 6
FIGURE 6
Regions‐of‐interest (ROIs) that encode the visual category/class perception of humans across five different conditions (Figure 1b). In the summary representation, mapping priority is given to the animacy, object, magnitude, digit, and digit versus object conditions in order when there is an overlapping voxel. A, anterior; DLPFC, dorsolateral prefrontal cortex; Dorsal, dorsal stream visual cortex; EAC, early auditory cortex; EVC, early visual cortex; I, inferior; IFC, inferior frontal cortex; IPC, inferior parietal cortex; L, left; M1/S1, primary motor cortex and primary somatosensory cortex; MCC, middle cingulate cortex; MPFC, medial prefrontal cortex; MT+/LOC, middle temporal (MT) complex and its neighboring visual areas including lateral occipital (LO) complex; MVOcC, medioventral occipital cortex; NFA, number form area (Grotheer, Herrmann, et al., 2016); OFC, orbitofrontal cortex; P, posterior; PCC, posterior cingulate cortex; POC/FOC, posterior opercular cortex and frontal opercular cortex; pSTS, posterior superior temporal sulcus; R, right; S, superior; SPC, superior parietal cortex; TPOJ/AAC, temporo‐parieto‐occipital junction and auditory association cortex; Ventral, ventral stream visual cortex.
FIGURE 7
FIGURE 7
Cosine similarity between the RDM for the CNN layers and the RDM for human visual perception across the 16 classes of the digit and object. The mean and standard deviation across participants are illustrated. CNN, convolutional neural networks; Fc, fully connected layer; Output, output layer.
FIGURE 8
FIGURE 8
RSA using (i) the neural RDM within the ROIs identified from the visual category/class perception of humans and (ii) the RDM for each layer of the trained CNN model. Searchlight RSA is conducted for each of the voxels in the ROI. The voxels with positive t‐scores from group inference are summarized with their mean t‐score and standard error of the mean for each ROI. AAC, auditory association cortex; AAC, auditory association cortex; B, bilateral; CG, cingulate gyrus; CNN, convolutional neural networks; DLPFC, dorsolateral prefrontal cortex; Dorsal, dorsal stream visual cortex; EAC, early auditory cortex; EVC, early visual cortex; Fc, fully connected layer; FOC, frontal opercular cortex; IFC, inferior frontal cortex; IPC, inferior parietal cortex; ITG, inferior temporal gyrus; L, left; M1/S1, primary motor cortex and primary somatosensory cortex; MCC, middle cingulate cortex; MPFC, medial prefrontal cortex; MT+/LOC, middle temporal (MT) complex and its neighboring visual areas including lateral occipital (LO) complex; MVOcC, medioventral occipital cortex; OFC, orbitofrontal cortex; OrG, orbital gyrus; Output, output layer; PCC, posterior cingulate cortex; POC, posterior opercular cortex; R, right; SFG, superior frontal gyrus; SPC, superior parietal cortex; STG, superior temporal gyrus; TPOJ, temporo‐parieto‐occipital junction; Ventral, ventral stream visual cortex.

References

    1. Anobile, G. , Arrighi, R. , Castaldi, E. , & Burr, D. C. (2021). A sensorimotor numerosity system. Trends in Cognitive Sciences, 25(1), 24–36. 10.1016/j.tics.2020.10.009 - DOI - PubMed
    1. Ansari, D. , Lyons, I. M. , van Eimeren, L. , & Xu, F. (2007). Linking visual attention and number processing in the brain: The role of the temporo‐parietal junction in small and large symbolic and nonsymbolic number comparison. Journal of Cognitive Neuroscience, 19(11), 1845–1853. - PubMed
    1. Arsalidou, M. , Pawliw‐Levac, M. , Sadeghi, M. , & Pascual‐Leone, J. (2018). Brain areas associated with numbers and calculations in children: Meta‐analyses of fMRI studies. Developmental Cognitive Neuroscience, 30, 239–250. 10.1016/j.dcn.2017.08.002 - DOI - PMC - PubMed
    1. Arsalidou, M. , & Taylor, M. J. (2011). Is 2 + 2 = 4? Meta‐analyses of brain areas needed for numbers and calculations. NeuroImage, 54(3), 2382–2393. 10.1016/j.neuroimage.2010.10.009 - DOI - PubMed
    1. Ashkenazi, S. , Henik, A. , Ifergane, G. , & Shelef, I. (2008). Basic numerical processing in left intraparietal sulcus (IPS) acalculia. Cortex, 44(4), 439–448. 10.1016/j.cortex.2007.08.008 - DOI - PubMed

Publication types