Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 23;15(4):e1006897.
doi: 10.1371/journal.pcbi.1006897. eCollection 2019 Apr.

Deep convolutional models improve predictions of macaque V1 responses to natural images

Affiliations

Deep convolutional models improve predictions of macaque V1 responses to natural images

Santiago A Cadena et al. PLoS Comput Biol. .

Abstract

Despite great efforts over several decades, our best models of primary visual cortex (V1) still predict spiking activity quite poorly when probed with natural stimuli, highlighting our limited understanding of the nonlinear computations in V1. Recently, two approaches based on deep learning have emerged for modeling these nonlinear computations: transfer learning from artificial neural networks trained on object recognition and data-driven convolutional neural network models trained end-to-end on large populations of neurons. Here, we test the ability of both approaches to predict spiking activity in response to natural images in V1 of awake monkeys. We found that the transfer learning approach performed similarly well to the data-driven approach and both outperformed classical linear-nonlinear and wavelet-based feature representations that build on existing theories of V1. Notably, transfer learning using a pre-trained feature space required substantially less experimental time to achieve the same performance. In conclusion, multi-layer convolutional neural networks (CNNs) set the new state of the art for predicting neural responses to natural images in primate V1 and deep features learned for object recognition are better explanations for V1 computation than all previous filter bank theories. This finding strengthens the necessity of V1 models that are multiple nonlinearities away from the image domain and it supports the idea of explaining early visual cortex based on high-level functional goals.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

Fig 1
Fig 1. Stimulus paradigm.
A: Classes of images shown in the experiment. We used grayscale natural images (labeled ‘original’) from the ImageNet dataset [44] along with textures synthesized from these images using the texture synthesis algorithm described by [45]. Each row shows four synthesized versions of three example original images using different convolutional layers (see Materials and Methods for details). Lower convolutional layers capture more local statistics compared to higher ones. B: Stimulus sequence. In each trial, we showed a randomized sequence of images (each displayed for 60 ms covering 2 degrees of visual angle) centered on the receptive fields of the recorded neurons while the monkey sustained fixation on a target. The images were masked with a circular mask with cosine fadeout.
Fig 2
Fig 2. V1 electrophysiological responses.
A: Isolated single-unit activity. We performed acute recordings with a 32-channel, linear array (NeuroNexus V1x32-Edge-10mm-60-177, layout shown in the left) to record in primary visual cortex of two awake, fixating macaques. The channel mean-waveform footprints of the spiking activity of 23 well-isolated neurons in one example session are shown in the central larger panel. The upper panel shows color-matched autocorrelograms. B: Peri-stimulus time histograms (PSTH) of four example neurons from A. Spike counts where binned with t = 1 ms, aligned to the onset of each stimulus image, and averaged over trials. The 60 ms interval where the image was displayed is shown in red. We ignored the temporal profile of the response and extracted spike counts for each image on the 40–100 ms interval after image onset (shown in light gray). C: The Response Triggered Average (RTA) calculated by reverse correlation of the extracted responses.
Fig 3
Fig 3. Our proposed model based on VGG-19 features.
VGG-19 [28] (gray background) is a trained CNN that takes an input image and produces a class label. For each of the 16 convolutional layers of VGG-19, we extract the feature representations (feature maps) of the images shown to the monkey. We then train for each recorded neuron and convolutional layer, a Generalized Linear Model (GLM) using the feature maps as input to predict the observed spike counts. The GLM is formed by a linear projection (dot product) of the feature maps, a pointwise nonlinearity, and an assumed noise distribution (Poisson) that determines the optimization loss for training. We additionally imposed strong regularization constraints on the readout weights (see text).
Fig 4
Fig 4. Model performance on test set.
Average fraction of explainable variance explained (FEV) for models using different VGG layers as nonlinear feature spaces for a GLM. Error bars show 95% confidence intervals of the population means. The model based on layer conv3_1 shows on average the highest predictive performance.
Fig 5
Fig 5. VGG-19 based model performance at different input scales.
The performance on test set of cross-validated models that use as feature spaces layers conv2_1 to conv3_3 for different input resolutions. With the original scale used in Fig 4, we assumed that VGG-19 was trained with 6.4 degrees field of view. Scaling this resolution by a factor of 0.67 and 1.5 justifies the original choice of resolution for further analysis. At the bottom, the receptive field sizes in pixels of the different layers are shown.
Fig 6
Fig 6. Learned readout weights with different regularization modes.
A. For five example neurons (rows), the five highest-energy spatial readouts out of 256 feature maps of conv3_1 for each regularization scheme we explored. Feature map weights are 10 × 10 for a 40 × 40 input (∼ 1.1°). The full model exhibits the most localized and smooth receptive fields. The scale bar is shared by all features maps of a each model and neuron from their minimum (white) to the maximum (dark). B. The highest normalized spatial energy of the learned readouts as a function of ordered feature maps (first 70 out of 256 of conv3_1 shown) averaged for all cells. With regularization, only a few feature maps are used for prediction, quickly asymptoting at 1.
Fig 7
Fig 7. Data-driven convolutional network model.
We trained a convolutional neural network to produce a feature space fed to a GLM-like model. In contrast to the VGG-based model, both feature space and readout weights are trained only on the neural data. A. Three-layer architecture with a factorized readout [40] used for comparison with other models. B. Performance of the data driven approach as a function of the number of convolutional layers on held-out data. Three convolutional layers provided the best performance on the validation set. See Methods for details.
Fig 8
Fig 8. Deep models are the new state of the art.
A: Randomly selected cells. The normalized explainable variance (oracle) per cell is shown in gray. For each cell from left to right, the variance explained of: regularized LNP [46], GFB [22, 47, 48], three-layer CNN trained on neural responses, and VGG conv3_1 model (ours). B. CNN and VGG conv3_1 models outperform for most cells LNP and GFB. Black line denotes the identity. The performance is given in FEV. C: VGG conv3_1 features perform slightly better than the three-layer CNN. D: Performance of the four models in fraction of explainable variance explained (FEV) averaged across neurons. Error bars show 95% confidence intervals of the population means. All models perform significantly different from each other (Wilcoxon signed rank test, n = 166; family-wise error rate α = 0.05 using Holm-Bonferroni method to account for multiple comparisons).
Fig 9
Fig 9. Relationship between model performance and neurons’ tuning properties.
A. A sample subset of the Gabor stimuli with a rich diversity of frequencies, orientations, and phases.B. Dots: Orientation tuning curves of 80 sample neurons predicted by our CNN model. Tuning curves computed at the optimal spatial frequency and phase for each neuron. Lines: von Mises fits.C. Difference in performance between pairs of the four models as a function of tuning width. Tuning width was defined as the full width at half maximum of the fitted tuning curve.D. Dots: Phase tuning curves of the same 80 sample neurons as in B, predicted by our CNN model. Tuning curves computed at the optimal spatial frequency and orientation for each neuron. Lines: Cosine tuning curve with fitted amplitude and offset (see Methods).E. Like C, the difference in performance between pairs of models as a function of the neurons’ linearity index. Linearity index: ratio of amplitude of cosine over offset (0: complex; 1: simple).F. Performance comparison between GFB and LNP model. Red: simple cells (top 16% linearity, linearity > 0.3); blue: complex cells (bottom 28% linearity, linearity < 0.04).
Fig 10
Fig 10. Training and evaluation on the different stimulus types.
For both conv3_1 features of VGG-19 (left) and CNN-based models, we trained with all and every individual stimulus type (rows) (see Fig 1) and tested on all and every individual type. The VGG model showed good domain transfer in general. The same was true for the data-driven CNN model, although it performed worse overall when trained on only one set of images due to the smaller training sample. There were no substantial differences in performance across image statistics.

References

    1. Carandini M, Demb JB, Mante V, Tolhurst DJ, Dan Y, Olshausen BA, et al. Do we know what the early visual system does? The Journal of neuroscience. 2005;25(46):10577–10597. 10.1523/JNEUROSCI.3726-05.2005 - DOI - PMC - PubMed
    1. Hubel DH, Wiesel TN. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology. 1959;148(3):574–591. 10.1113/jphysiol.1959.sp006308 - DOI - PMC - PubMed
    1. Hubel DH, Wiesel TN. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology. 1968;195(1):215–243. 10.1113/jphysiol.1968.sp008455 - DOI - PMC - PubMed
    1. Jones JP, Palmer LA. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of neurophysiology. 1987;58(6):1233–1258. 10.1152/jn.1987.58.6.1233 - DOI - PubMed
    1. Heeger DJ. Half-squaring in responses of cat striate cells. Visual neuroscience. 1992;9(05):427–443. 10.1017/S095252380001124X - DOI - PubMed

Publication types