Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 15;16(1):1657.
doi: 10.1038/s41467-025-56731-y.

Exploring replay

Affiliations

Exploring replay

Georgy Antonov et al. Nat Commun. .

Abstract

Animals face uncertainty about their environments due to initial ignorance or subsequent changes. They therefore need to explore. However, the algorithmic structure of exploratory choices in the brain still remains largely elusive. Artificial agents face the same problem, and a venerable idea in reinforcement learning is that they can plan appropriate exploratory choices offline, during the equivalent of quiet wakefulness or sleep. Although offline processing in humans and other animals, in the form of hippocampal replay and preplay, has recently been the subject of highly informative modelling, existing methods only apply to known environments. Thus, they cannot predict exploratory replay choices during learning and/or behaviour in the face of uncertainty. Here, we extend an influential theory of hippocampal replay and examine its potential role in approximately optimal exploration, deriving testable predictions for the patterns of exploratory replay choices in a paradigmatic spatial navigation task. Our modelling provides a normative interpretation of the available experimental data suggestive of exploratory replay. Furthermore, we highlight the importance of sequence replay, and license a range of new experimental paradigms that should further our understanding of offline processing.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors have no competing interests.

Figures

Fig. 1
Fig. 1. Example planning tree for scheduling replay updates in MAB belief space.
a 2-arm bandit problem visualised as a planning tree with horizon 2. Rectangles correspond to belief states, and the tree is rooted at the agent’s current prior belief state, bρ (leftmost belief state). The beliefs about the bandit associated with each belief state are shown as probability distributions over the potential payoff probabilities in curly brackets. The red dotted lines show the expected (mean) payoff for each belief. Note that the agent’s prior belief state indicated a slightly higher expected payoff for the known arm a2 (Ep(μ1bρ)[μ1]=0.50 and μ2 = 0.51); however, it was largely uncertain about the payoff probability associated with the unknown arm a1. Actions (pulling arm a1 and a2) are depicted as black arrows originating at the prior belief state. The upper arrow corresponds to pulling arm a1 and the lower to pulling arm a2. Note that pulling arm a2 reveals no new information to the agent since the expected payoff associated with arm a2 is known with certainty (hence b=bρ). By contrast, pulling arm a1 can result in potential knowledge in the form of two new posterior belief states—one resulting from receiving an imaginary reward (upper rectangle, bs) and the other to receiving no reward (lower rectangle, bf). The estimated Q-values associated with the two actions, that is the amount of total discounted reward the agent expects to accrue in the long-run for performing each of those actions, are shown in red, and the certainty-equivalent value estimates of the belief states at the final horizon are shown with purple numbers. Numbers within belief states show exploratory Need associated with those belief states. Need is additionally shown with the intensity of the colour of each belief state; γ is the discount factor, T is the belief state transition model, i is the horizon at which the update belief state resides, and πold is the agent’s current behavioural policy in the tree. The agent’s behavioural policy in the tree was a softmax with inverse temperature β = 2. The exploratory Gain associated with the two potential replay updates is written next to the associated action arrows. The potential replay update to the value of the unknown arm a1 in this example bandit resulted in a higher Q-value estimate for that arm even though the agent’s prior belief state indicated a lower expected immediate payoff for this arm. This is because that replay update propagated the exploratory value of what the agent can learn about the unknown arm in the future (the resulting posterior belief states), and how that information can subsequently be exploited (the certainty-equivalent future return associated with those posterior belief states). The exploratory Gain for this replay update was therefore estimated to have a higher value. b Planning tree with the same horizon for the case where the agent has no uncertainty about the payoff probabilities for each arm, and assumes they are μ1 = 0.50 and μ2 = 0.51 (thus the same expected payoffs as in (a). This corresponds to the setup studied by Mattar & Daw, in which the agent does not account for how its belief state might change in the light of potential future learning. Note that in this case the agent estimated larger Gain for arm a2 which had a higher estimated immediate payoff, and therefore the resulting choice of replay could only lead to exploitation of the agent’s current knowledge.
Fig. 2
Fig. 2. Replay in MAB belief space.
a Same 2-arm bandit problem as in Fig. 1a visualised as a planning tree with horizon 3. The notation in the figure is the same as that for Fig. 1. b First replay update executed by the agent is highlighted in green. Note also the new updated Q-value associated with the updated action (red number), as well as the changes in exploratory Need associated with all belief states (due to the new updated behavioural policy in the tree). Here, the agent’s behavioural policy in tree tree was a softmax with β = 1, and therefore all belief states had positive exploratory Need (rounded up to 2 decimal places). c All replay updates executed by the agent are shown in green (the replay order is written on top of the updated actions). The agent only executed 5 replay updates because the estimated benefit of no further replay updates exceeded the threshold ξ. df Effects of replay updates on (top) the value of the prior belief state of the agent; (middle) the value of the new updated behavioural policy evaluated in the tree; (bottom) the number of executed replay updates as a function of the threshold ξ for planning trees of varying depth (horizon) and behavioural policies with different inverse temperature parameters. In all cases, the value of the prior belief state as well as the evaluated policy approached the Bayes-optimal value for the corresponding horizon (BO value; black dotted line), which is the full dynamic programming solution to the belief tree. We additionally show the optimal certainty-equivalent value (CE value; red dotted line) which corresponds to the original formulation of Mattar & Daw with uncertainty collapsed.
Fig. 3
Fig. 3. Exploitative replay can result in suboptimal behaviour.
a Normalised state occupancy of the agent during first 2000 moves of exploration and learning in the environment. The start state is located at the bottom (shown with the white letter `S') and the goal state is shown with the yellow clover. The barriers are shown as opaque blue lines. Importantly, all barriers were not bidirectional, and hence could only be learnt about when attempted from an adjacent state from below. All states were visited by the agent, including those besides the barriers (darker blue corresponds to higher occupancy). b Normalised maximal Gain that the agent estimated for the replay of each action (depicted with triangles), averaged across all 2000 moves. Only those actions for which the Gain was estimated to be positive are shown (darker red corresponds to higher Gain). The actions which the agent would replay (a subset of the actions with positive estimated Gain) yielded a more exploitative policy which helped the agent acquire reward at a higher rate. C) Normalised maximal Need for each state that the agent estimated, also averaged over those same 2000 moves. All values were additionally averaged over 10 simulations. Darker orange corresponds to higher Need. df Same as (ac) but for additional 2000 moves during which the top barrier was removed. Note that the estimated Gain did not change. Moreover, the state occupancy profile in (d), as well as the estimated Need in (f) highlight how the agent’s behaviour reduced to pure exploitation. Because of the environmental change, however, this behaviour was rendered suboptimal due to the existence of a shorter path that the agent did not discover.
Fig. 4
Fig. 4. Example planning tree for scheduling replay updates in joint belief-physical state space.
The logic of the belief tree is the same as that of Fig. 1 with one critical difference: each belief state now comprises the agent’s physical location in the maze (shown with green dot), as well as its belief about the presence of an impassable barrier in the maze (the barrier is shown with the thick blue line). The arrows in the maze show the available actions and are coloured according to their estimated Q-values. The actions with the highest estimated Q-values are shown with white outlines. The state values are additionally coloured with purple. Thus, if the agent decides to not attempt crossing the barrier (action `down', bottom arrow), then the agent transitions into a new belief state which corresponds to a new physical state and the same belief about the barrier, since it was not attempted and thus nothing was learnt about it. By contrast, the action `up' (top arrow) can result into two new belief states: the agent transitioning behind the barrier and thus discovering that it is absent (bs), and the agent remaining in the same physical state and discovering that the barrier is present (bf). Note that the agent’s prior belief state indicated with high expected probability that the barrier was absent. Moreover, the exploratory choice of action `up' could result into a transition to a physical state which happened to have a high estimated value (the physical state associated with the belief state bs), and therefore the exploratory Gain associated with updating that action was estimated to be high.
Fig. 5
Fig. 5. Exploratory replay leads to online discoveries, but potentially inadequate promulgation.
a Prior state of knowledge of the agent. The intensity of the (red-scale) colour of each action arrow shows the respective model-free Q-values. Collectively, the action values represent the agent’s model-free behavioural policy (i.e., the agent is more likely to choose actions with higher estimated Q-values – which at each state are highlighted with white outlines). Similarly, the states are coloured according to the maximal model-free Q-value at each state (which corresponds to state values, shown in purple). The inset next to the top barrier indicates the agent’s prior belief about its presence (for the other barrier, the agent was certain that the path was blocked). The red dotted line in the inset shows the expected probability that the barrier is absent. The agent itself (green dot) is located at the start state. The goal state with reward is denoted with the yellow clover. b Changes in the agent’s model-free policy occasioned by exploratory replay updates. Here, the colour intensity shows the amount of change engendered by each replay update. The numbers next to each action arrow indicate the order in which the replay updates were executed. c New model-free policy which resulted from exploratory replay updates in (b). Note how the action values now indicate that the agent should go towards the upper barrier (highlighted with white outlines). d After pursuing the exploratory policy, the agent attempted to cross the top barrier; unfortunately, the barrier was found to be present – this is indicated by both the agent’s model-free Q-value associated with that action which was learnt online, as well as its new belief. e-f Same as in (b-c) but after the online discovery of the present barrier in (d). As opposed to propagating the negative information about the present barrier towards the start state, and hence correcting the exploratory policy in the light of the new information, the replay choice of the agent made it more likely to visit an adjacent state which still contained the previously propagated exploration bonus, and hence had a high value that was erroneous given the agent’s new knowledge.
Fig. 6
Fig. 6. Sequence replay helps deep value propagation.
The layout of the figure is the same as in Fig. 5. ac Show the agent’s initial and uncertain state of knowledge, changes to the online behavioural policy occasioned by exploratory replay, and the new updated exploratory policy due to such replay, respectively. The crucial difference being that the replay in (b) was a sequence event – i.e., the whole chain of actions was updated simultaneously (the actions which were updated in the replayed sequence are linked by a green line; the green triangles along that line additionally indicate the reverse direction of the replayed sequence). df Again, the agent discovered the top barrier, learnt about its presence online and engaged in replay to recompile its model-free behavioural policy in the light of the negative information. Note how, in this case, sequence replay in (e) resulted in deep propagation of the value of such information all the way towards the start state. The sequence replay thus enabled the agent to correct its exploratory policy appropriately as shown in (f).
Fig. 7
Fig. 7. Replay in a blocked corridor.
a Initial state of knowledge of the agent. The agent’s belief state comprised its uncertainty about the presence of the top and bottom barriers that create the corridor. Note that the model-free Q-values in the blocked corridor are all initialised to 0, thus mimicking the agent’s inexperience with the segment. Additionally, here the barrier which blocked the entrance to the corridor was bi-directional so that the state above it had at least two actions (for the replay to be able to improve the policy at that state; see Methods for details). b Replay choices of the agent due to its initial and uncertain state of knowledge. Note that the sequence replay event (numbered as '1') was a single pass through the shown actions but performed across two different belief states: action updates inside the corridor (top) corresponded to a different belief state since they followed the potential transition through the bottom barrier which the agent had to first learn about (bottom). The single sequence replay in this example is split into two panels to explicitly show the two different belief states. The order of the replay updates is shown in the bottom panel. c New exploratory policy occasioned by the replay updates in (b). d Same setup as above, but simulating offline rest replay in the T-maze experiment from Olafsdottir et al.. The top row shows the initial state of knowledge of the agent. In the actual experiment, 'Rest 1' replay events were measured before the animals' experience of the environment, and during 'Run 1' they explored the central stem which was blocked by a see-through barrier. In 'Run 1', none of the arms contained a visible reward (which are depicted with unfilled yellow clovers). No detectable replay was observed in the two arms during the 'Rest 1' condition. 'Rest 2' replay events were measured during a rest period after a visible reward was placed in the 'cued' arm (filled yellow clover) but before the animals could experience it (i.e., before the barrier was removed). Note that we rendered the see-through barrier as potentially permeable (as reflected in the agent’s uncertain belief) due to which the agent could contemplate during rest the possibility of crossing it and obtaining the reward. The bottom row shows the resulting exploratory policy after the agent was allowed to replay with the knowledge of the reward in the cued arm. This new policy resulted from replay only in the cued arm. Note that, as in (b), such replay was performed in a different belief state (corresponding to learning that the barrier was open) than the agent’s prior belief state, and thus could potentially only be detected after the actual experience. Data re-plotted from Fig. 2d in Olafsdottir et al.. Yellow dotted lines show chance detection level. ns, not significant; ***p <  0.001 (derived from a binomial test, see ref. for details).

References

    1. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artif. Intell.101, 99–134 (1998).
    1. Gittins, J. C. Bandit processes and dynamic allocation indices. J. R. Stat. Soc. B (Methodological)41, 148–164 (1979).
    1. Krebs, J. R., Kacelnik, A. & Taylor, P. Test of optimal sampling by foraging great tits. Nature275, 27–31 (1978).
    1. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature441, 876–879 (2006). - PMC - PubMed
    1. Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore-exploit dilemma. J. Exp. Psychol. Gen.143, 2074–2081 (2014). - PMC - PubMed

LinkOut - more resources