Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 9:5:2398212820975634.
doi: 10.1177/2398212820975634. eCollection 2021 Jan-Dec.

Reinforcement learning approaches to hippocampus-dependent flexible spatial navigation

Affiliations

Reinforcement learning approaches to hippocampus-dependent flexible spatial navigation

Charline Tessereau et al. Brain Neurosci Adv. .

Abstract

Humans and non-human animals show great flexibility in spatial navigation, including the ability to return to specific locations based on as few as one single experience. To study spatial navigation in the laboratory, watermaze tasks, in which rats have to find a hidden platform in a pool of cloudy water surrounded by spatial cues, have long been used. Analogous tasks have been developed for human participants using virtual environments. Spatial learning in the watermaze is facilitated by the hippocampus. In particular, rapid, one-trial, allocentric place learning, as measured in the delayed-matching-to-place variant of the watermaze task, which requires rodents to learn repeatedly new locations in a familiar environment, is hippocampal dependent. In this article, we review some computational principles, embedded within a reinforcement learning framework, that utilise hippocampal spatial representations for navigation in watermaze tasks. We consider which key elements underlie their efficacy, and discuss their limitations in accounting for hippocampus-dependent navigation, both in terms of behavioural performance (i.e. how well do they reproduce behavioural measures of rapid place learning) and neurobiological realism (i.e. how well do they map to neurobiological substrates involved in rapid place learning). We discuss how an actor-critic architecture, enabling simultaneous assessment of the value of the current location and of the optimal direction to follow, can reproduce one-trial place learning performance as shown on watermaze and virtual delayed-matching-to-place tasks by rats and humans, respectively, if complemented with map-like place representations. The contribution of actor-critic mechanisms to delayed-matching-to-place performance is consistent with neurobiological findings implicating the striatum and hippocampo-striatal interaction in delayed-matching-to-place performance, given that the striatum has been associated with actor-critic mechanisms. Moreover, we illustrate that hierarchical computations embedded within an actor-critic architecture may help to account for aspects of flexible spatial navigation. The hierarchical reinforcement learning approach separates trajectory control via a temporal-difference error from goal selection via a goal prediction error and may account for flexible, trial-specific, navigation to familiar goal locations, as required in some arm-maze place memory tasks, although it does not capture one-trial learning of new goal locations, as observed in open field, including watermaze and virtual, delayed-matching-to-place tasks. Future models of one-shot learning of new goal locations, as observed on delayed-matching-to-place tasks, should incorporate hippocampal plasticity mechanisms that integrate new goal information with allocentric place representation, as such mechanisms are supported by substantial empirical evidence.

Keywords: Morris watermaze; Reinforcement learning; computational modelling; hierarchical agent; one-shot learning; place learning and memory; spatial navigation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Figures

Figure 1.
Figure 1.
One-shot place learning by rats in the delayed-matching-to-place (DMP) watermaze task. (a) Rats have to learn a new goal location (location of escape platform) every day and complete four navigation trials to the new location on each day. (b) The time taken to find the new location reduces markedly from trial 1 to 2, with little further improvements on trials 3 and 4, and minimal interference between days. (c) When trial 2 is run as a probe trial, during which the platform is unavailable, rats show marked search preference for the vicinity of the goal location. To measure search preference, the watermaze surface is divided into eight equivalent symmetrically arranged zones (stippled lines in sketch), including the ‘correct zone’ centred on the goal location (black dot). The search preference corresponds to time spent searching in the ‘correct zone’, expressed as a percentage of time spent in all eight zones together. The chance level corresponds to 12.5%, corresponding to the rat spending the same time in each of the eight zones depicted in the sketch. These behavioural measures highlight successful one-shot place learning. Figure adapted from Figure 2 in Bast et al. (2009).
Figure 2.
Figure 2.
Basic principles of reinforcement learning (RL). (a) Key components of RL models. An agent in the state st (which in spatial context often corresponds to a specific location in the environment) associated with the reward rt takes the action at to move from one state to another within its environment. Depending on the available routes and on the rewards in the environment, this action leads to the reception of a potential reward rt+1 in the subsequent state (or location) st+1. (b) Model-free versus model-based approaches in RL. In model-free RL (right), an agent learns the values of the states on the fly, that is, by trial and error, and adjusts its behaviour accordingly (in order to maximise its expected rewards). In model-based RL (left), the agent learns or is given the transition probabilities between states within an environment and the rewards associated with the states (a ‘model’ of the environment), and uses this information to plan ahead and select the most successful trajectory.
Figure 3.
Figure 3.
(a) Classical actor–critic architecture for a temporal-difference (TD) agent learning to solve a spatial navigation problem in the watermaze, as proposed by Foster et al. (2000). The state (location of the agent (x(t),y(t))) is encoded within a neural network (in this case, the units mimic place cells in the hippocampus). State information is fed into an action network, which computes the best direction to go next, and to a critic network that computes the value of the states encountered. The difference in critic activity along with the reception or not of the reward (given at the goal location) are used to compute the TD error δt, such that successful moves (that lead to a positive TD error) are more likely to be taken again in the future and less likely otherwise. Simultaneously, the critic’s estimation of the value function is adjusted in order to be more accurate. These updates occur through changing the critic and actor weights, respectively Wt and Zt. The goal location, marked as a circle within the maze, is the only location in which a reward is given. (b) Performance of the agent, obtained by implementing the model from Foster et al. (2000). The time that the agent requires to get to the goal (‘Latencies’, vertical axis) reduces with trials (horizontal axis) and reaches almost a minimum (after trial 5). When the goal changes (on trial 20), the agent takes a very long time to adapt to this new goal location.
Figure 4.
Figure 4.
(a) Architecture of the coordinate-based navigation system, which was added to the actor–critic system shown in Figure 3(a) to reproduce accurate spatial navigation based on one-trial place learning, as observed in the watermaze DMP task (Foster et al., 2000). Place cells are linked to coordinate estimators through plastic connections Wtx,Wty. The estimated coordinates X^,Y^ are used to compare the estimated location of the goal X^goal,Y^goal to the agent estimated location X^t,Y^t in order to form a vector towards the goal, that is being followed when choosing the ‘coordinate action’ acoord. The new action acoord is integrated into the actor network described in Figure 3(a). (b, c) Performance of the extended model using coordinate-based navigation. (b) Escape latencies of the agent when the goal location is changed every four trials, mimicking the watermaze DMP task. (c) ‘Search preference’ for the area surrounding the goal location, as reflected by the percentage of time the agent spends in an area centred on the goal location when the second trial to a new goal location is run as probe trial, with the goal removed (stippled line indicates percentage of time spent in the correct zone by chance, that is, if the agent had no preference for any particular area), computed for the second and the seventh goal locations. One-trial learning of the new goal location is reflected by the marked latency reduction from trial 1 to trial 2 to a new goal location (without interference between successive goal locations) and by the marked search preference for the new goal location when trial 2 is run as probe. The data in (b) were obtained by computing the model in (Foster et al., 2000) and the data in (c) by adapting the model in order to reproduce search preference measures when trial 2 was run as a probe trial. The increase in search preference observed between the second and seventh goal location is addressed in the section ‘Limitations of the model in reproducing DMP behaviour in rats and humans’.
Figure 5.
Figure 5.
(a) Hierarchical RL model. The agent has learned the critic and action connection weights (Zj and Wj, respectively) for each goal j red circles around the maze. The actor and critic networks together, as represented in Figure 3(a), form the strategy j. A goal prediction error δG is used to compute a confidence parameter σ, which measures how good the current strategy is in reaching the current goal location. The confidence level shapes the degree of exploitation of the current strategy β through a sigmoid function of confidence. When the confidence level is very high, the strategy chosen is closely followed, as shown by a high exploitation parameter β. On the contrary, a low confidence level leads to more exploration of the environment. (b) Performance of the hierarchical agent. The model is able to adapt to changing goal locations, as seen in the reduction of latencies to reach the goal.

References

    1. Anggraini D, Glasauer S, Wunderlich K. (2018) Neural signatures of reinforcement learning correlate with strategy adoption during spatial navigation. Scientific Reports 8(1): 10110. - PMC - PubMed
    1. Annett L, McGregor A, Robbins T. (1989) The effects of ibotenic acid lesions of the nucleus accumbens on spatial learning and extinction in the rat. Behavioural Brain Research 31(3): 231–242. - PubMed
    1. Balleine BW. (2019) The meaning of behavior: Discriminating reflex and volition in the brain. Neuron 104(1): 47–62. - PubMed
    1. Balleine BW, Dezfouli A, Ito M, et al.. (2015) Hierarchical control of goal-directed action in the cortical–basal ganglia network. Current Opinion in Behavioral Sciences 5: 1–7.
    1. Banino A, Barry C, Uria B, et al.. (2018) Vector-based navigation using grid-like representations in artificial agents. Nature 557(7705): 429–433. - PubMed

LinkOut - more resources