Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 2;10(9):602.
doi: 10.3390/brainsci10090602.

Study on Representation Invariances of CNNs and Human Visual Information Processing Based on Data Augmentation

Affiliations

Study on Representation Invariances of CNNs and Human Visual Information Processing Based on Data Augmentation

Yibo Cui et al. Brain Sci. .

Abstract

Representation invariance plays a significant role in the performance of deep convolutional neural networks (CNNs) and human visual information processing in various complicated image-based tasks. However, there has been abounding confusion concerning the representation invariance mechanisms of the two sophisticated systems. To investigate their relationship under common conditions, we proposed a representation invariance analysis approach based on data augmentation technology. Firstly, the original image library was expanded by data augmentation. The representation invariances of CNNs and the ventral visual stream were then studied by comparing the similarities of the corresponding layer features of CNNs and the prediction performance of visual encoding models based on functional magnetic resonance imaging (fMRI) before and after data augmentation. Our experimental results suggest that the architecture of CNNs, combinations of convolutional and fully-connected layers, developed representation invariance of CNNs. Remarkably, we found representation invariance belongs to all successive stages of the ventral visual stream. Hence, the internal correlation between CNNs and the human visual system in representation invariance was revealed. Our study promotes the advancement of invariant representation of computer vision and deeper comprehension of the representation invariance mechanism of human visual information processing.

Keywords: CNNs; data augmentation; fMRI visual encoding model; human visual information processing; representation invariance.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

Figure 1
Figure 1
Feature representation and data augmentation technology. (A) Invariant representation and equivariant representation. The original image is rotated and translated to generate new images. The upward direction is an invariant representation, mostly for senior semantic features such as category. For example, “6” will be classified as 6 or 9. Conversely, the upward direction is an equivariant representation, mostly for low-level features, such as position and texture. (B) Data augmentation technology. The data augmentation image library (right) is obtained by cropping, scaling, and flipping sequentially (left).
Figure 2
Figure 2
Visual encoding models. (A) The CNN-linear model. The first step is a nonlinear mapping that a pre-trained CNN (i.e., AlexNet) represents image features to construct the feature space. The second step is the linear mapping from the feature space to the brain activity space. In the brain activity space, different colored dots represent the responses of different voxels. Each dot represents the response of a voxel to one image. (B) The CNN-TL model. The first step is the same as that of CNN-linear model, but the second step is the nonlinear mapping, achieved by transfer learning.
Figure 3
Figure 3
An analysis method of human visual invariance based on the fMRI visual encoding model. The original encoding model and augmented encoding model are trained by the original image library and data augmentation image library respectively. Prediction accuracies of each voxel predicted by the two models were compared. If the prediction accuracies of voxel 1 (orange), voxel 2 (purple), voxel 3 (green) rise, remain unchanged, and decline, respectively, there will be a decreasing tendency of representation invariance from voxel 1 to voxel 3.
Figure 4
Figure 4
Features similarities. Each box shows feature similarities of a pair of corresponding CNN layers before and after data augmentation.
Figure 5
Figure 5
CNN layer preferences of the ventral visual stream in visual encoding models. (A) CNN layer preferences of the ventral visual stream in the CNN-linear model and augmented CNN-linear model. The CNN layer preferences of each ROI distribute in a single column. The contribution to the mean prediction accuracy of all voxels in that ROI is shown by colored bars within each column. (B) CNN layer preferences of the ventral visual stream in the CNN-TL model and augmented CNN-linear model.
Figure 6
Figure 6
Prediction accuracies of visual encoding models for the ventral visual stream. (A) Prediction accuracies of the CNN-linear model and augmented CNN-linear model. The red column and blue column show the prediction accuracies of the CNN-linear model and the augmented CNN-linear model for a single ROI respectively. (B) Prediction accuracies of the CNN-TL model and augmented CNN-TL model. The red column and blue column show the prediction accuracies of the CNN-TL model and the augmented CNN-TL model for a single ROI, respectively.
Figure 7
Figure 7
The advantage comparisons of visual encoding models. Each sector shows the proportion of voxels in one ROI, which can be predicted correctly by encoding models for comparison. Each color within sectors indicates the proportion of voxels best predicted by one encoding model. The CNN-linear model, augmented CNN-linear model, CNN-TL model, and augmented CNN-TL model are displayed by dark blue, light blue, red, and orange respectively.

References

    1. Agrawal P., Stansbury D., Malik J., Gallant J.L. Pixels to Voxels: Modeling Visual Representation in the Human Brain. arXiv. 201414075104
    1. Guclu U., Van Gerven M.A.J. Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. J. Neurosci. 2015;35:10005–10014. doi: 10.1523/JNEUROSCI.5023-14.2015. - DOI - PMC - PubMed
    1. Yamins D.L.K., DiCarlo J.J. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 2016;19:356–365. doi: 10.1038/nn.4244. - DOI - PubMed
    1. Eickenberg M., Gramfort A., Varoquaux G., Thirion B. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage. 2017;152:184–194. doi: 10.1016/j.neuroimage.2016.10.001. - DOI - PubMed
    1. Seeliger K., Fritsche M., Güçlü U., Schoenmakers S., Schoffelen J.-M., Bosch S.E., Van Gerven M.A.J. Convolutional neural network-based encoding and decoding of visual object recognition in space and time. NeuroImage. 2018;180:253–266. doi: 10.1016/j.neuroimage.2017.07.018. - DOI - PubMed

LinkOut - more resources