Review

. 2021 Jul 21:15:686239.

doi: 10.3389/fncom.2021.686239. eCollection 2021.

Learning Invariant Object and Spatial View Representations in the Brain Using Slow Unsupervised Learning

Edmund T Rolls^{1

2}

Affiliations

¹ Oxford Centre for Computational Neuroscience, Oxford, United Kingdom.
² Department of Computer Science, University of Warwick, Coventry, United Kingdom.

PMID: 34366818
PMCID: PMC8335547
DOI: 10.3389/fncom.2021.686239

Review

Learning Invariant Object and Spatial View Representations in the Brain Using Slow Unsupervised Learning

Edmund T Rolls. Front Comput Neurosci. 2021.

. 2021 Jul 21:15:686239.

doi: 10.3389/fncom.2021.686239. eCollection 2021.

Author

Edmund T Rolls^{1

2}

Affiliations

¹ Oxford Centre for Computational Neuroscience, Oxford, United Kingdom.
² Department of Computer Science, University of Warwick, Coventry, United Kingdom.

PMID: 34366818
PMCID: PMC8335547
DOI: 10.3389/fncom.2021.686239

Abstract

First, neurophysiological evidence for the learning of invariant representations in the inferior temporal visual cortex is described. This includes object and face representations with invariance for position, size, lighting, view and morphological transforms in the temporal lobe visual cortex; global object motion in the cortex in the superior temporal sulcus; and spatial view representations in the hippocampus that are invariant with respect to eye position, head direction, and place. Second, computational mechanisms that enable the brain to learn these invariant representations are proposed. For the ventral visual system, one key adaptation is the use of information available in the statistics of the environment in slow unsupervised learning to learn transform-invariant representations of objects. This contrasts with deep supervised learning in artificial neural networks, which uses training with thousands of exemplars forced into different categories by neuronal teachers. Similar slow learning principles apply to the learning of global object motion in the dorsal visual system leading to the cortex in the superior temporal sulcus. The learning rule that has been explored in VisNet is an associative rule with a short-term memory trace. The feed-forward architecture has four stages, with convergence from stage to stage. This type of slow learning is implemented in the brain in hierarchically organized competitive neuronal networks with convergence from stage to stage, with only 4-5 stages in the hierarchy. Slow learning is also shown to help the learning of coordinate transforms using gain modulation in the dorsal visual system extending into the parietal cortex and retrosplenial cortex. Representations are learned that are in allocentric spatial view coordinates of locations in the world and that are independent of eye position, head direction, and the place where the individual is located. This enables hippocampal spatial view cells to use idiothetic, self-motion, signals for navigation when the view details are obscured for short periods.

Keywords: convolutional neural network; face cells; hippocampus; inferior temporal visual cortex; navigation; object recognition; spatial view cells; unsupervised learning.

PubMed Disclaimer

Conflict of interest statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Convergence in the visual system. **(Right)** Convergence in the ventral stream cortical hierarchy for object recognition. LGN, lateral geniculate nucleus; V1, visual cortex area V1; TEO, posterior inferior temporal cortex; TE, anterior inferior temporal cortex (IT). **(Left)** Convergence as implemented in VisNet, the model of invariant visual object recognition described here. Convergence through the hierarchical feedforward network is designed to provide Layer 4 neurons with information from across the entire input retina, by providing an increase of receptive field size of 2.5 times at each stage. Layer 1 of the VisNet model corresponds to V2 in the brain, and Layer 4 to the anterior inferior temporal visual cortex (TE). In this paper ‘Layer’ with a capital L indicates a Layer of a neuronal network which may correspond to a brain region as here. This is distinct from the 6 architectonic layers in neocortex, designated here with a small letter l in ‘layer’.

**FIGURE 2**
Encoding of information in intermediate Layers of VisNet. The 13 stimuli used to investigate independent coding of different feature combinations by different neurons in intermediate Layers of VisNet. Each of the 13 stimuli was a different feature, or feature combination with adjacent features, that was learned to be a different object by VisNet, demonstrating that VisNet can learn to represent objects as different even when they have overlapping features. Moreover, these feature combination neurons could be used by further combination in higher Layers of VisNet to represented more complex objects. (After Rolls and Mills, 2018).

**FIGURE 3**
Deformation-invariant object recognition. The flag stimuli used to train VisNet to demonstrate deformation-invariant object recognition. Each flag is shown with different wind forces and rotations. Starting on the left with the first pair of images for each flag, both the 0 and 180° views are shown for a windspeed of 0; and each successive pair is shown for the wind force increased by 50 Blender units. Visnet learned to categorize these 4 flags as 4 different flags provided that the different deformations of each flag were shown close together in the temporal sequence during training, to make use of the trace learning rule. (After Webb and Rolls, 2014).

**FIGURE 4**
Learning non-accidental properties of objects. The stimuli used to investigate non-accidental properties (NAP) vs. metric properties (MP) of encoding in VisNet. Each object is shown as white on a gray background. Objects 1–3 all have the non-accidental property of concave edges. Objects 1–3 are different in their metric properties, the amount of curvature. Object 4 has the non-accidental property of parallel edges. Objects 5–7 have the non-accidental property of convex edges, and different metric properties from each other, the amount of the convexity. The vertical view of each object was at 0° of tilt, with the images at –6 and 6° of tilt illustrated. Different amounts of tilt of the top toward or away from the viewer are shown at the tilt angles indicated. Each object was thin, and was cut off near at the top and bottom of each object to ensure that any view of the top or bottom of the object did not appear, so that the type of curvature of the edges (concave, straight, or convex) was the main cue available. (After Rolls and Mills, 2018).

**FIGURE 5**
Cortical architecture for hierarchical and attention-based visual perception. The system has six modules organized so that they resemble the ventral visual stream **(Left)** and dorsal visual stream **(Right)** of the primate visual system. Information from the lateral geniculate (LGN) enters V1. The ventral visual stream leads through V2–V4 to the inferior temporal visual cortex (IT), and is mainly concerned with object recognition. The dorsal visual stream leads via areas such as MT into the posterior parietal cortex (PP), and is involved in this model in maintaining a spatial map of an object’s location. The solid lines with arrows between levels show the forward connections, and the dashed lines the top-down backprojections. Short-term memory systems in the prefrontal cortex (PF46) apply top-down attentional bias to the object (from PFv) or spatial processing (from OFd) streams. (After Deco and Rolls, 2004).

**FIGURE 6**
Finding and recognizing objects in natural scenes. **(A)** Eight of the twelve test scenes. Each scene has four objects, each shown in one of its 4 views. **(B)** The bottom-up saliency map generated by the GBVS code for one of the scenes. The highest levels in the saliency map are red, and the lowest blue. **(C)** Rectangles (384 pixels × 384 pixels) placed around each saliency peak in the scene for which the bottom-up saliency map is illustrated in **(B)**. (After Rolls and Webb, 2014).

**FIGURE 7**
View invariant representations by VisNet but not by HMAX. The two objects, cups, each with four views. HMAX of Riesenhuber and Poggio (1999) fails to categorize these objects correctly, because, unlike VisNet, it has no slow learning mechanism to associate together different views of the same object. (After Robinson and Rolls, 2015).

**FIGURE 8**
Continuous spatial transformation learning of transform-invariant visual representations of objects. This illustrates how continuous spatial transformation (CT) learning would operate in a network with forward synaptic connections between an input Layer of neurons and an output Layer. Initially the forward synaptic connection weights are set to random values. **(A)** The presentation of a stimulus to the network in position 1. Activation from the active (shaded black) input neurons is transmitted through the initially random forward connections to activate the neurons in the output Layer. The neuron shaded black in the output Layer wins the competition in the output Layer. The synaptic weights from the active input neurons to the active output neuron are then strengthened using an associative synaptic learning rule. **(B)** The situation after the stimulus is shifted by a small amount to a new partially overlapping position 2. Because some of the active input neurons are the same as those that were active when the stimulus was presented in position 1, the same output neuron is driven by these previously strengthened synaptic afferents to win the competition. The rightmost input neuron shown in black is activated by the stimulus in position 2, and was inactive when the stimulus was in position 1, now has its synaptic connection to the active output neuron strengthened (denoted by the dashed line). Thus the same neuron in the output Layer has learned to respond to the two input patterns that have vector elements that overlap. The process can be continued for subsequent shifts, provided that a sufficient proportion of input neurons is activated by each new shift to activate the same output neuron. (After Stringer et al., 2006).

**FIGURE 9**
Invariant object-based global motion in the dorsal visual system. This shows two wheels at different locations in the visual field rotating in the same direction. One rotating wheel is presented at a time, and a representation is needed in the case illustrated that the rotating flow field produced by the wheel in either location is always clockwise. The local flow field in V1 and V2 is ambiguous about the direction of rotation of the two wheels, because of the small receptive field size. Rotation that is clockwise or counterclockwise can only be identified by a global flow computation, with larger receptive fields. The diagram shows how a network with stages like those found in the brain can solve the problem to produce position invariant global motion-sensitive neurons by Layer 3. The computation involved is convergence from stage to stage as illustrated, combined with a short-term memory trace synaptic learning rule to help the network learn that it is the same wheel rotating in the same direction as it moves across the visual field during training (during development). This is the computational architecture of VisNet. It was demonstrated that VisNet can learn translation invariant representations of these types of object-based motion, by substituting the normal Gabor filters as the input neurons in the input Layer corresponding to V1 with local optic flow motion neurons also present in V1. (After Rolls and Stringer, 2006b).

**FIGURE 10**
Coordinate transforms in the primate dorsal visual system. Three stages of coordinate transforms that take place at different levels of the primate dorsal visual system are shown. At each stage the coordinate transform is performed by gain modulation of the receptive field by an appropriate modulator, that is usefully combined with slow learning of the type implemented in VisNet which helps the same neurons at a particular stage to develop what are effectively representations that become independent of the modulating signal. In Layer 1 gain modulation by eye position combined with slow learning enables neurons to develop representations in head-centered coordinates that are invariant with respect to retinal and eye position. In Layer 2 gain modulation by head direction combined with slow learning enables neurons to develop representations in allocentric bearing to a stimulus such as a landmark coordinates that are invariant with respect to head direction. In Layer 3 gain modulation by the place where the individual is located combined with slow learning enables neurons to develop representations of a stimulus such as a landmark that are in allocentric spatial view coordinates with invariance with respect to where the individual is located. The diagram shows the architecture of the VisNetCT model in which gain modulation combined with short-term memory trace associative learning was shown to implement these transforms (Rolls, 2020). Each neuron in a Layer (or cortical area in the hierarchy) receives from neurons in a small region of the preceding Layer. It is proposed that idiothetic update through this dorsal visual cortical stream is used for idiothetic update of hippocampal spatial view cells useful for navigation when the environment may not be visible for short periods (Rolls, 2020, 2021b). PCC, posterior cingulate cortex; RSC, retrosplenial cortex.

See this image and copyright information in PMC

References

1. Abeles M. (1991). Corticonics - Neural Circuits of the Cerebral Cortex. New York: Cambridge University Press.
1. Aggelopoulos N. C., Franco L., Rolls E. T. (2005). Object perception in natural scenes: encoding by inferior temporal cortex simultaneously recorded neurons. J. Neurophys. 93 1342–1357. 10.1152/jn.00553.2004 - DOI - PubMed
1. Aggelopoulos N. C., Rolls E. T. (2005). Natural scene perception: inferior temporal cortex neurons encode the positions of different objects in the scene. Eur. J. Neurosci. 22 2903–2916. 10.1111/j.1460-9568.2005.04487.x - DOI - PubMed
1. Akrami A., Liu Y., Treves A., Jagadeesh B. (2009). Converging neuronal activity in inferior temporal cortex during the classification of morphed stimuli. Cereb. Cortex 19 760–776. 10.1093/cercor/bhn125 - DOI - PMC - PubMed
1. Arcizet F., Mirpour K., Bisley J. W. (2011). A pure salience response in posterior parietal cortex. Cereb. Cortex 21 2498–2506. 10.1093/cercor/bhr035 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning Invariant Object and Spatial View Representations in the Brain Using Slow Unsupervised Learning

Affiliations

Learning Invariant Object and Spatial View Representations in the Brain Using Slow Unsupervised Learning

Author

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources