. 2010 Aug 19;6(8):e1000894.

doi: 10.1371/journal.pcbi.1000894.

Reinforcement learning on slow features of high-dimensional input streams

Robert Legenstein¹, Niko Wilbert, Laurenz Wiskott

Affiliations

PMID: 20808883
PMCID: PMC2924248
DOI: 10.1371/journal.pcbi.1000894

Reinforcement learning on slow features of high-dimensional input streams

Robert Legenstein et al. PLoS Comput Biol. 2010.

. 2010 Aug 19;6(8):e1000894.

doi: 10.1371/journal.pcbi.1000894.

Authors

Robert Legenstein¹, Niko Wilbert, Laurenz Wiskott

Affiliation

¹ Institute for Theoretical Computer Science, Graz University of Technology, Graz, Austria. legi@igi.tugraz.at

PMID: 20808883
PMCID: PMC2924248
DOI: 10.1371/journal.pcbi.1000894

Abstract

Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. The learning system and the simulation setup.**
The learning system (gray box) consists of a hierarchical slow-feature analysis network, which reduces the dimensionality of the high-dimensional visual input. This reduction is trained in an unsupervised manner. The extracted features from the SFA network serve as inputs for a small neural network that produces the control commands. This network is trained by simple reward-modulated learning. We tested the learning system in a closed-loop setup. The system controlled an agent in an environment (universe). The state of the environment was accessible to the learning system via a visual sensory stream of dimension 155155. A reward signal was made accessible to the control network for learning.

formula image — **Figure 1. The learning system and the simulation setup.**
The learning system (gray box) consists of a hierarchical slow-feature analysis network, which reduces the dimensionality of the high-dimensional visual input. This reduction is trained in an unsupervised manner. The extracted features from the SFA network serve as inputs for a small neural network that produces the control commands. This network is trained by simple reward-modulated learning. We tested the learning system in a closed-loop setup. The system controlled an agent in an environment (universe). The state of the environment was accessible to the learning system via a visual sensory stream of dimension 155155. A reward signal was made accessible to the control network for learning.

**Figure 2. Examples for the visual input to the learning system for the variable-targets task.**
The scene consists of three objects, the agent (fish), an object that indicates the location of the target, and a second object that acts as a distractor. As indicated in the figure the target object depends on the fish identity. For the fish identity shown in the upper panels the target is always the disk, whereas the for the other fish identity, the target is the cross. In the visual input for the water-maze task the target and the distractor are not present, and the agent representation is the non-rotated image of the fish-type shown in the upper panels.

**Figure 3. Model architecture and stimuli.**
An input image is fed into the hierarchical network. The circles in each layer symbolize the overlapping receptive fields, which converge towards the top layer. The same set of steps is applied on each layer, which is visualized on the right hand side.

**Figure 4. Receptive field of nodes in layer 3.**
Each dot represents the 32 dimensional SFA output from one node. The field overlap is 2 nodes and the borders of the receptive fields are represented by the black lines between the dots.

**Figure 5. Performance of the learning system in the Morris water maze task with Q-learning.**
A) Mean escape latency (in simulation time steps) as a function of learning episodes for 10 independent sets of episodes (full thick line). The thin dashed line indicates the standard deviation. B) The navigation map of the system after training. The vectors indicate the movement directions the system would most likely choose at the given positions in the water maze. An episode ended successfully when the center of the fish reached the area indicated by the gray disk.

**Figure 6. Rewards and escape latencies during training of the control task with target and distractor.**
A) Evolution of reward during training. A simulation step for all 100 parallel traces corresponds to 100 time-steps at the x-axis. The plotted values are averages over consecutive 20,000 time steps. B) Evolution of escape latencies (measured in time steps) during training. The number of episodes on the x-axis is the number of completed traces. The plotted values are averages over 1,200 consecutive episodes. C,D) Same as panels A and B, but learning was performed on a highly condensed and precise state-encoding instead of the SFA network output. Shown is the performance for learning on 100 parallel traces (black, full line) and without parallel traces (gray, dashed line). Convergence is comparable to learning on SFA outputs. The results without parallel traces are very similar to the results with parallel traces.

**Figure 7. Three representative trajectories after training of the control task with target and distractor.**
Each row summarizes one representative learning trial. Shown is the visual input at start position (left column), the visual input when the goal was reached (middle column), and the whole trajectory (right column). In the trajectory, fish positions (small black discs), target region (large circle), and distractor location (gray rectangle) are shown.

**Figure 8. Performance of a PCA based hierarchical network.**
Rewards (A) and escape latencies (B) in the variable-targets control experiment with a PCA based hierarchical network. The control network is not able to learn the task based on this state representation. Note the larger scaling factor for the time-axis in panel A.

See this image and copyright information in PMC

Cited by

Democratic population decisions result in robust policy-gradient learning: a parametric study with GPU simulations.
Richmond P, Buesing L, Giugliano M, Vasilaki E. Richmond P, et al. PLoS One. 2011 May 4;6(5):e18539. doi: 10.1371/journal.pone.0018539. PLoS One. 2011. PMID: 21572529 Free PMC article.
Slow feature analysis on retinal waves leads to V1 complex cells.
Dähne S, Wilbert N, Wiskott L. Dähne S, et al. PLoS Comput Biol. 2014 May 8;10(5):e1003564. doi: 10.1371/journal.pcbi.1003564. eCollection 2014 May. PLoS Comput Biol. 2014. PMID: 24810948 Free PMC article.
An inductive bias for slowly changing features in human reinforcement learning.
Hedrich NL, Schulz E, Hall-McMaster S, Schuck NW. Hedrich NL, et al. PLoS Comput Biol. 2024 Nov 25;20(11):e1012568. doi: 10.1371/journal.pcbi.1012568. eCollection 2024 Nov. PLoS Comput Biol. 2024. PMID: 39585903 Free PMC article.
View-invariance learning in object recognition by pigeons depends on error-driven associative learning processes.
Soto FA, Siow JY, Wasserman EA. Soto FA, et al. Vision Res. 2012 Jun 1;62:148-61. doi: 10.1016/j.visres.2012.04.004. Epub 2012 Apr 17. Vision Res. 2012. PMID: 22531015 Free PMC article.
Neuronal learning of invariant object representation in the ventral visual stream is not dependent on reward.
Li N, Dicarlo JJ. Li N, et al. J Neurosci. 2012 May 9;32(19):6611-20. doi: 10.1523/JNEUROSCI.3786-11.2012. J Neurosci. 2012. PMID: 22573683 Free PMC article.

References

1. Thorndike E. Animal Intelligence. CT: Hafner, Darien; 1911.
1. Bertsekas DP, Tsitsiklis J. Neuro-Dynamic Programming. Athena Scientific; 1996.
1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT Press; 1998.
1. Schultz W, Dayan P, Montague P. A neural substrate of prediction and reward. Science. 1997;275:1593–9. - PubMed
1. Reynolds JN, Wickens JR. Dopamine-dependent plasticity of corticostriatal synapses. Neural Netw. 2002;15:507–521. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reinforcement learning on slow features of high-dimensional input streams

Affiliation

Reinforcement learning on slow features of high-dimensional input streams

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources