Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 21;372(6544):eabf1357.
doi: 10.1126/science.abf1357.

Experience replay is associated with efficient nonlocal learning

Affiliations

Experience replay is associated with efficient nonlocal learning

Yunzhe Liu et al. Science. .

Abstract

To make effective decisions, people need to consider the relationship between actions and outcomes. These are often separated by time and space. The neural mechanisms by which disjoint actions and outcomes are linked remain unknown. One promising hypothesis involves neural replay of nonlocal experience. Using a task that segregates direct from indirect value learning, combined with magnetoencephalography, we examined the role of neural replay in human nonlocal learning. After receipt of a reward, we found significant backward replay of nonlocal experience, with a 160-millisecond state-to-state time lag, which was linked to efficient learning of action values. Backward replay and behavioral evidence of nonlocal learning were more pronounced for experiences of greater benefit for future behavior. These findings support nonlocal replay as a neural mechanism for solving complex credit assignment problems during learning.

PubMed Disclaimer

Conflict of interest statement

Competing interests

None.

Figures

Fig. 1
Fig. 1. Experimental design for model-based reinforcement learning task.
(A) At each trial of the main RL task, subjects were presented with one of the three starting arms according to a fixed probability, and asked to select one from two alternative paths within this arm. This was followed by a transition through the associated path states and ended with an outcome (£1 or 0). The reward probability of the end states (i.e., X and Y) varied slowly and independently over time. A crucial feature of this task is that the end states are shared across all three arms, which enables non-local learning. Need is manipulated by the starting probability of each arm, shown as colour codes on the left. Gain is manipulated by the fluctuating reward probability of end states, X and Y, respectively. (B) An example of such drifting reward schedule. The reward probability of X and Y changes gradually and independently over trials, with gaussian random walk, bounded between 25% and 75%. (C) Each phase of the experiment is shown in order. Subjects learnt the task model before commencing the main RL task. (D) An example of task trial in the main RL task. On the top, the text indicates what subjects need to do at a given time point in the trial. On the bottom, the corresponding symbols of task stimuli are shown. All photos shown are from pixabay.com and are in the public domain.
Fig. 2
Fig. 2. Behavioural evidence of non-local learning.
(A) An illustration of sequences of states for local (left – single path) and non-local experience (right – two non-local paths). The black arrow indicates the direction of actual experience, the red arrow indicates the hypothesized direction of credit (i.e., outcome, £1 or 0) assignment after receiving reward, solid red for the local experience, and dotted red for the two non-local experiences. (B) Behavioural results. The difference in performance between reward and no-reward in non-local paths is a defining feature of non-local learning. Rew/Non indicates whether subjects were rewarded or not rewarded on the last trial. P (same choice) is the probability that subjects, in the current trial, select the path leading to the same end state as that on the last trial. Error bars show the 95% standard error of the mean, each dot indicating results from each subject. * indicates p < 0.05, ** indicates p < 0.01.
Fig. 3
Fig. 3. Multivariate stimuli decoding.
(A) Examples of multivariate whole-brain neural activity for classifier training, e.g., girl, house, and zebra. (B) Example of “house” classifier performance (red) plotted against all other 17 stimuli classifiers, when “house” picture was presented. The mapping between visual stimuli and their index states was randomised across subjects. (C) Mean decoding result for all subjects. The temporal generalisation plot is on the left panel, with Y axis indicating the time bins (10 ms each), the classifiers were trained on, and the X axis indicating the test time of classifiers. On the right panel, we plot the diagonal pattern of the temporal generalisation, namely the decoding accuracy obtained at the same time we trained the classifiers on. The dotted line is the permutation threshold. The mean performance for each individual state is shown in Supplementary Figure 1, data for each subject is shown in Supplementary Figure 3A.
Fig. 4
Fig. 4. Sequential replay of experiences during reward receipt.
(A) An illustrative exemplar trial in the main RL task is shown (subject 14, trial 107). On the left panel, subject selected an A1->A2->A3 path, which renders A1->A2->A3 as the local experience, and C1->C2->C3 and E1->E2->E3 as two non-local experiences on this trial. On the right panel, the state decoding matrix during outcome receipt time (e.g., getting £1 in X) is shown, along with the gain estimate for the two non-local paths. A backward 160 ms lag sequences for both C1->C2->C3 and E1->E2->E3 path, and a forward 30 ms lag sequence for A1->A2->A3, are depicted. For visualization purpose, the reactivation strength of each state is max-normalised. Each time bin is 10 ms. (B) Sequence analysis at outcome receipt time shows two distinct signatures, one forward sequence (blue) with a 20-30 ms state-to-state time lag (left panel), and a backward sequence (red), with a 130-170 ms time lag (right panel). The X axis is the time lags. The Y axis is the evidence of sequence strength. (C) Contrast between backward and forward sequences in the computed time lags (i.e., speed). In this contrast, a forward sequence peaked at 30 ms time lag, and a backward sequence peaked at 160 ms time lag. Consequently, these time-points were selected for all later analyses. The dotted line is the permutation threshold after controlled for multiple comparisons.
Fig. 5
Fig. 5. Representational and physiological differences between the two types of replay.
(A) A 30 ms forward sequence is likely to encode local experience, but not non-local. (B) A 160 ms backward replay encodes non-local as opposed to local experience. (C) The initialization of 30 ms forward sequence is associated with a power increase in a ripple frequency band (80-180 Hz), but this is not the case for 160 ms backward sequence. These frequency power signatures are significantly different. The grey line connects results from the same subject. Error bars show the 95% standard error of the mean, each dot indicating results from each subject. * indicates p < 0.05, ** indicates p < 0.01.
Fig. 6
Fig. 6. Prioritisation of non-local replay.
(A) 160ms backward sequence is replayed to a greater degree in the higher priority non-local path compared to lower priority one. The 30 ms forward replay does not differentiate between the two non-local paths. Error bars show the 95% standard error of the mean, with dots indicating results from each subject. * indicates p < 0.05, ** indicates p < 0.01. Grey line connects results from the same subject. (B) Sequence strength of 30 ms lag replay does not correlate with task performances (left panel). By contrast there is a significant positive correlation between the 160 ms lag replay and task performances across subjects (right panel). Each dot indicates result from one subject. The solid line reflects the best robust linear fit. The dotted line indicates the chance level of reward rate per trial with random choices.

References

    1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience. 2005;8:1704–1711. - PubMed
    1. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. - PubMed
    1. O’Doherty J, et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. science. 2004;304:452–454. - PubMed
    1. Doll BB, Duncan KD, Simon DA, Shohamy D, Daw ND. Model-based choices involve prospective neural activity. Nature neuroscience. 2015;18:767. - PMC - PubMed
    1. Wimmer GE, Shohamy D. Preference by association: how memory mechanisms in the hippocampus bias decisions. Science. 2012;338:270–273. - PubMed

Publication types

LinkOut - more resources