Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 8;14(1):3988.
doi: 10.1038/s41467-023-39536-9.

Reinforcement learning establishes a minimal metacognitive process to monitor and control motor learning performance

Affiliations

Reinforcement learning establishes a minimal metacognitive process to monitor and control motor learning performance

Taisei Sugiyama et al. Nat Commun. .

Abstract

Humans and animals develop learning-to-learn strategies throughout their lives to accelerate learning. One theory suggests that this is achieved by a metacognitive process of controlling and monitoring learning. Although such learning-to-learn is also observed in motor learning, the metacognitive aspect of learning regulation has not been considered in classical theories of motor learning. Here, we formulated a minimal mechanism of this process as reinforcement learning of motor learning properties, which regulates a policy for memory update in response to sensory prediction error while monitoring its performance. This theory was confirmed in human motor learning experiments, in which the subjective sense of learning-outcome association determined the direction of up- and down-regulation of both learning speed and memory retention. Thus, it provides a simple, unifying account for variations in learning speeds, where the reinforcement learning mechanism monitors and controls the motor learning process.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Meta-learning theory and paradigm.
A Motor learning as a sequential decision-making process. The action u(k) updates the memory and sensory prediction error states {x(k), e(k)} to the next states {x(k+1), e(k+1)}, and generates a reward r(x(x+1)) in the given environment (p: perturbation). The action u(k) responds to {x(k), e(k)}, characterized by meta-parameter θ and influenced by memory noise nx(k), i.e., drawn from a policy distribution u(k)~πθ(u(k)x(k),e(k)). This aligns with the previous models of error-based motor learning x(k+1)=αx(k)+βe(k)+nx(k) (α: retention rate, β: learning rate),, when the learner has a linear policy function. B The primary hypothesis of this study is that the meta-parameter θ=[α,β]T is updated by reinforcement learning rule (policy gradient) to maximize rewards and minimize punishments: θθ+θlogπθr(x(x+1)). C Simulated change of meta-parameters in the two opposite reward functions. Reward is given for learning in “Promote” (magenta) and for not-learning in “Suppress” (cyan). Reinforcement learning upregulates θ=[α,β]T to learn faster in Promote, whereas it downregulates them to learn slower in Suppress. a.u. = arbitrary unit. D Meta-learning training. Learners experience a sensory-error (E) trial where the sensory prediction error e is induced by cursor rotation p while the task error is clamped (TE clamp). Subsequently, they experience a reward (R) trial in which the updated memory u, manifested as an aftereffect h = Tx, is evaluated with reward function r. Promote and Suppress were implemented by linking the aftereffect and reward oppositely. Reward is delivered as a numerical score associated with monetary reward. E The task schedule. Learners repeat meta-learning training that comprises pairs of E and R trials and Null trials (in which the veridical cursor feedback was given). After every six repetitions of training, they perform a probe task, developed from previously established motor learning paradigms to estimate learning parameters. The simulated reach behavior and changes in θ are plotted for Promote and Suppress. F The task is separated into four blocks, and behavior is analyzed block-by-block. The first block marked in pink is the baseline condition in which score is absent in R trials.
Fig. 2
Fig. 2. Changes in behavior and estimated learning parameters with negative scores (Experiment 1).
A Memory profiles of the two groups in Blocks 1 and 4 of meta-learning training with opposite reward functions (Promote, magenta, and Suppress, cyan). B Changes during blocks and estimated linear slopes in the initial/accumulated memory updates in the first/last R trials. C Changes during blocks and estimated linear slopes of the average score per trial. D Memory profiles of the two groups in Blocks 1 and 4 of Probe. E Changes over blocks and estimated linear slopes in the initial memory update in the second E trial with rotation and the accumulation and retention of memory, measured as the average of all error-clamp trials. F Changes during blocks and estimated linear slopes in estimated learning parameters. For all panels, thick lines/dots/circles and error bars/shading represent group means and SEMs, except the error bars in F, where box and whisker represent the 25–75th and 2.5-97.5th percentiles for the posterior density estimated by the Markov Chain Monte Carlo (MCMC) method. Faded lines represent individual participants’ data in (B), (C), and (E). Each mean and SEM are calculated for data from 20 human participants per group (in total N = 40). NS indicates “Not Significant” between Promote and Suppress at the pre-perturbation baseline trial (two-sided Wilcoxon rank sum test for Training Block 1: p = 0.95 and Block 4: p = 0.19, Probe-Block 1: p = 0.60 and Block 4: p = 0.90).
Fig. 3
Fig. 3. Changes in behavior and estimated learning parameters with positive scores (Experiment 2).
A Memory profiles of the two groups in Blocks 1 and 4 of meta-learning training with opposite reward functions (Promote, magenta, and Suppress, cyan). B Changes during blocks and estimated linear slopes in the initial/accumulated memory updates in the first/last R trials. C Changes in memory size during blocks and estimated linear slopes in the average score per trial. D Memory (aftereffect) profiles of the two groups in Blocks 1 and 4 of Probe. E Changes during blocks and estimated linear slopes in the initial memory update in the second E trial with rotation and the accumulation and retention of memory, measured as the average of all error-clamp trials. F Changes over blocks and estimated linear slopes in estimated learning parameters. G Meta-learning rates (η) for each experiment and the difference between punishments and rewards (difference between Experiments 1 and 2). For all panels, lines/dots/circles and error bars/shading represent group means and SEMs, except the error bars in (F) and (G), where box and whisker represent the 25–75th and 2.5–97.5th percentiles for the posterior density estimated by the MCMC method. Faded lines represent individual participants’ data in (B), (C), and (E). Each mean and SEM are calculated for data from 20 human participants per group (in total N = 40 for AF, N = 80 for G). NS indicates “Not Significant” (two-sided Wilcoxon rank sum test for Training Block 1: p = 0.70 and Block4: p = 0.53, Probe-Block 1: p = 0.49 and Block 4: p = 0.23).
Fig. 4
Fig. 4. Replication of previous reports on changes in the speed of motor learning by simulation with the present Meta-learning model.
Assuming that task error is a punishment feedback, the present meta-learning model replicated previous reports. A, B Changes in error-sensitivity (speed of learning, i.e., β) in different probabilities of flip in the perturbation direction influencing the history of error, reported in the original study (A) and simulated by the model (B). Sensitivity increased in a stable environment (z = 0.9, red) and decreased in a rapidly changing environment (z = 0.1, blue). Sensitivity did not show apparent changes in medium stability (z = 0.5, green). The simulated memory profile at the end of the task was also plotted for each condition (B, left). C, D Effects of manipulating task error on motor learning, reported in the original study (C) and simulated by the model (D). Acceleration of learning (saving, blue) disappeared when a task error was randomly given (green) or removed (red). E, F Effects of reward/punishment on trajectory error, reported in the original study (E) and simulated by the model (F). The inset shows the used meta-learning rates for the simulation. Learning accelerated only in adaptation to punishment (black), compared to adaptation to reward (red) or random positive (blue). For all panels, lines/dots and error bars/shaded areas indicate the mean and SEM. a.u. = arbitrary unit. Panel (A) is from David J. Herzfeld et al., A memory of errors in sensorimotor learning. Science, 345,1349-1353 (2014). DOI:10.1126/science.1253138. Reprinted with permission from AAAS. Panel (C) is adapted with permission from Leow, L. A., Marinovic, W., de Rugy, A. & Carroll, T. J. Task errors drive memories that improve sensorimotor adaptation. J Neurosci. 40(15), 3075-3088 (2020). 10.1523/JNEUROSCI.1506-19.2020. https://creativecommons.org/licenses/by/4.0/. Panel (E) is reproduced with permission from Springer Nature. Galea, J., Mallia, E., Rothwell, J. et al. The dissociable effects of punishment and reward on motor learning. Nat. Neurosci. 18, 597–602 (2015). 10.1038/nn.3956, Springer Nature.

Similar articles

Cited by

References

    1. Colthorpe K, Sharifirad T, Ainscough L, Anderson S, Zimbardi K. Prompting undergraduate students’ metacognition of learning: implementing “meta-learning’ assessment tasks in the biomedical sciences. Assess. Eval. High. Edu. 2018;43:272–285. doi: 10.1080/02602938.2017.1334872. - DOI
    1. Derry SJ, Murphy DA. Designing systems that train learning-ability - from theory to practice. Rev. Educ. Res. 1986;56:1–39. doi: 10.3102/00346543056001001. - DOI
    1. Anne Pirrie A, Thoutenhoofd E. Learning to learn in the European Reference Framework for lifelong learning. Oxf. Rev. Educ. 2013;39:609–626. doi: 10.1080/03054985.2013.840280. - DOI
    1. Mazzoni P, Krakauer JW. An implicit plan overrides an explicit strategy during visuomotor adaptation. J. Neurosci. 2006;26:3642–3645. doi: 10.1523/JNEUROSCI.5317-05.2006. - DOI - PMC - PubMed
    1. McDougle SD, Bond KM, Taylor JA. Explicit and implicit processes constitute the fast and slow processes of sensorimotor learning. J. Neurosci. 2015;35:9568–9579. doi: 10.1523/JNEUROSCI.5061-14.2015. - DOI - PMC - PubMed

Publication types