. 2023 Jul 8;14(1):3988.

doi: 10.1038/s41467-023-39536-9.

Reinforcement learning establishes a minimal metacognitive process to monitor and control motor learning performance

Taisei Sugiyama¹, Nicolas Schweighofer², Jun Izawa³

Affiliations

¹ Empowerment Informatics, University of Tsukuba, Tsukuba, Ibaraki, 305-8573, Japan.
² Biokinesiology and Physical Therapy, University of Southern California, Los Angeles, CA, 90089-9006, USA.
³ Institute of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, 305-8573, Japan. izawa@emp.tsukuba.ac.jp.

PMID: 37422476
PMCID: PMC10329706
DOI: 10.1038/s41467-023-39536-9

Reinforcement learning establishes a minimal metacognitive process to monitor and control motor learning performance

Taisei Sugiyama et al. Nat Commun. 2023.

. 2023 Jul 8;14(1):3988.

doi: 10.1038/s41467-023-39536-9.

Authors

Taisei Sugiyama¹, Nicolas Schweighofer², Jun Izawa³

Affiliations

¹ Empowerment Informatics, University of Tsukuba, Tsukuba, Ibaraki, 305-8573, Japan.
² Biokinesiology and Physical Therapy, University of Southern California, Los Angeles, CA, 90089-9006, USA.
³ Institute of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, 305-8573, Japan. izawa@emp.tsukuba.ac.jp.

PMID: 37422476
PMCID: PMC10329706
DOI: 10.1038/s41467-023-39536-9

Abstract

Humans and animals develop learning-to-learn strategies throughout their lives to accelerate learning. One theory suggests that this is achieved by a metacognitive process of controlling and monitoring learning. Although such learning-to-learn is also observed in motor learning, the metacognitive aspect of learning regulation has not been considered in classical theories of motor learning. Here, we formulated a minimal mechanism of this process as reinforcement learning of motor learning properties, which regulates a policy for memory update in response to sensory prediction error while monitoring its performance. This theory was confirmed in human motor learning experiments, in which the subjective sense of learning-outcome association determined the direction of up- and down-regulation of both learning speed and memory retention. Thus, it provides a simple, unifying account for variations in learning speeds, where the reinforcement learning mechanism monitors and controls the motor learning process.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Meta-learning theory and paradigm.**
A Motor learning as a sequential decision-making process. The action u^(k) updates the memory and sensory prediction error states {x^(k), e^(k)} to the next states {x^(k+1), e^(k+1)}, and generates a reward $r (x^{(x + 1)})$ in the given environment (p: perturbation). The action u^(k) responds to {x^(k), e^(k)}, characterized by meta-parameter θ and influenced by memory noise $n_{x}^{(k)}$ , i.e., drawn from a policy distribution $u^{(k)} ~ π_{θ} (u^{(k)} ∣ x^{(k)}, e^{(k)})$ . This aligns with the previous models of error-based motor learning $x^{(k + 1)} = α x^{(k)} + β e^{(k)} + n_{x}^{(k)}$ (α: retention rate, β: learning rate)^,, when the learner has a linear policy function. B The primary hypothesis of this study is that the meta-parameter $θ = {[α, β]}^{T}$ is updated by reinforcement learning rule (policy gradient) to maximize rewards and minimize punishments: $θ \leftarrow θ + \nabla_{θ} \log π_{θ} \cdot r (x^{(x + 1)})$ . C Simulated change of meta-parameters in the two opposite reward functions. Reward is given for learning in “**Promote**” (magenta) and for not-learning in “**Suppress**” (cyan). Reinforcement learning upregulates $θ = {[α, β]}^{T}$ to learn faster in **Promote**, whereas it downregulates them to learn slower in **Suppress**. a.u. = arbitrary unit. D Meta-learning training. Learners experience a sensory-error (E) trial where the sensory prediction error e is induced by cursor rotation p while the task error is clamped (TE clamp). Subsequently, they experience a reward (R) trial in which the updated memory u, manifested as an aftereffect h = T−x, is evaluated with reward function r. **Promote** and **Suppress** were implemented by linking the aftereffect and reward oppositely. Reward is delivered as a numerical score associated with monetary reward. E The task schedule. Learners repeat meta-learning training that comprises pairs of E and R trials and Null trials (in which the veridical cursor feedback was given). After every six repetitions of training, they perform a probe task, developed from previously established motor learning paradigms to estimate learning parameters. The simulated reach behavior and changes in θ are plotted for **Promote** and **Suppress**. F The task is separated into four blocks, and behavior is analyzed block-by-block. The first block marked in pink is the baseline condition in which score is absent in R trials.

**Fig. 2. Changes in behavior and estimated learning parameters with negative scores (Experiment 1).**
A Memory profiles of the two groups in Blocks 1 and 4 of meta-learning training with opposite reward functions (**Promote**, magenta, and **Suppress**, cyan). B Changes during blocks and estimated linear slopes in the initial/accumulated memory updates in the first/last R trials. C Changes during blocks and estimated linear slopes of the average score per trial. D Memory profiles of the two groups in Blocks 1 and 4 of Probe. E Changes over blocks and estimated linear slopes in the initial memory update in the second E trial with rotation and the accumulation and retention of memory, measured as the average of all error-clamp trials. F Changes during blocks and estimated linear slopes in estimated learning parameters. For all panels, thick lines/dots/circles and error bars/shading represent group means and SEMs, except the error bars in F, where box and whisker represent the 25–75th and 2.5-97.5th percentiles for the posterior density estimated by the Markov Chain Monte Carlo (MCMC) method. Faded lines represent individual participants’ data in (B), (C), and (E). Each mean and SEM are calculated for data from 20 human participants per group (in total N = 40). NS indicates “Not Significant” between **Promote** and **Suppress** at the pre-perturbation baseline trial (two-sided Wilcoxon rank sum test for Training Block 1: p = 0.95 and Block 4: p = 0.19, Probe-Block 1: p = 0.60 and Block 4: p = 0.90).

**Fig. 3. Changes in behavior and estimated learning parameters with positive scores (Experiment 2).**
A Memory profiles of the two groups in Blocks 1 and 4 of meta-learning training with opposite reward functions (**Promote**, magenta, and **Suppress**, cyan). B Changes during blocks and estimated linear slopes in the initial/accumulated memory updates in the first/last R trials. C Changes in memory size during blocks and estimated linear slopes in the average score per trial. D Memory (aftereffect) profiles of the two groups in Blocks 1 and 4 of Probe. E Changes during blocks and estimated linear slopes in the initial memory update in the second E trial with rotation and the accumulation and retention of memory, measured as the average of all error-clamp trials. F Changes over blocks and estimated linear slopes in estimated learning parameters. G Meta-learning rates (η) for each experiment and the difference between punishments and rewards (difference between Experiments 1 and 2). For all panels, lines/dots/circles and error bars/shading represent group means and SEMs, except the error bars in (F) and (G), where box and whisker represent the 25–75th and 2.5–97.5th percentiles for the posterior density estimated by the MCMC method. Faded lines represent individual participants’ data in (B), (C), and (E). Each mean and SEM are calculated for data from 20 human participants per group (in total N = 40 for A–F, N = 80 for G). NS indicates “Not Significant” (two-sided Wilcoxon rank sum test for Training Block 1: p = 0.70 and Block4: p = 0.53, Probe-Block 1: p = 0.49 and Block 4: p = 0.23).

**Fig. 4. Replication of previous reports on changes in the speed of motor learning by simulation with the present Meta-learning model.**
Assuming that task error is a punishment feedback, the present meta-learning model replicated previous reports. A, B Changes in error-sensitivity (speed of learning, i.e., β) in different probabilities of flip in the perturbation direction influencing the history of error, reported in the original study (A) and simulated by the model (B). Sensitivity increased in a stable environment (z = 0.9, red) and decreased in a rapidly changing environment (z = 0.1, blue). Sensitivity did not show apparent changes in medium stability (z = 0.5, green). The simulated memory profile at the end of the task was also plotted for each condition (B, left). C, D Effects of manipulating task error on motor learning, reported in the original study (C) and simulated by the model (D). Acceleration of learning (saving, blue) disappeared when a task error was randomly given (green) or removed (red). E, F Effects of reward/punishment on trajectory error, reported in the original study (E) and simulated by the model (F). The inset shows the used meta-learning rates for the simulation. Learning accelerated only in adaptation to punishment (black), compared to adaptation to reward (red) or random positive (blue). For all panels, lines/dots and error bars/shaded areas indicate the mean and SEM. a.u. = arbitrary unit. Panel (A) is from David J. Herzfeld et al., A memory of errors in sensorimotor learning. Science, 345,1349-1353 (2014). DOI:10.1126/science.1253138. Reprinted with permission from AAAS. Panel (C) is adapted with permission from Leow, L. A., Marinovic, W., de Rugy, A. & Carroll, T. J. Task errors drive memories that improve sensorimotor adaptation. J Neurosci. 40(15), 3075-3088 (2020). 10.1523/JNEUROSCI.1506-19.2020. https://creativecommons.org/licenses/by/4.0/. Panel (E) is reproduced with permission from Springer Nature. Galea, J., Mallia, E., Rothwell, J. et al. The dissociable effects of punishment and reward on motor learning. Nat. Neurosci. 18, 597–602 (2015). 10.1038/nn.3956, Springer Nature.

See this image and copyright information in PMC

Cited by

Motor synergy and energy efficiency emerge in whole-body locomotion learning.
Li G, Hayashibe M. Li G, et al. Sci Rep. 2025 Jan 3;15(1):712. doi: 10.1038/s41598-024-82472-x. Sci Rep. 2025. PMID: 39753645 Free PMC article.
The effects of reward and punishment on the performance of ping-pong ball bouncing.
Yin C, Wang Y, Li B, Gao T. Yin C, et al. Front Behav Neurosci. 2024 Jun 27;18:1433649. doi: 10.3389/fnbeh.2024.1433649. eCollection 2024. Front Behav Neurosci. 2024. PMID: 38993267 Free PMC article.
Exploring motor skill acquisition in bimanual coordination: insights from navigating a novel maze task.
Cienfuegos M, Maycock J, Naceri A, Düsterhus T, Kõiva R, Schack T, Ritter H. Cienfuegos M, et al. Sci Rep. 2024 Aug 14;14(1):18887. doi: 10.1038/s41598-024-69200-1. Sci Rep. 2024. PMID: 39143119 Free PMC article.
Meta-learning of human motor adaptation via the dorsal premotor cortex.
Sugiyama T, Uehara S, Izawa J. Sugiyama T, et al. Proc Natl Acad Sci U S A. 2024 Oct 29;121(44):e2417543121. doi: 10.1073/pnas.2417543121. Epub 2024 Oct 23. Proc Natl Acad Sci U S A. 2024. PMID: 39441634 Free PMC article.
Learning-to-learn as a metacognitive correlate of functional outcomes after stroke: a cohort study.
Sugiyama T, Uehara S, Yuasa A, Ushizawa K, Izawa J, Otaka Y. Sugiyama T, et al. Eur J Phys Rehabil Med. 2024 Oct;60(5):750-760. doi: 10.23736/S1973-9087.24.08446-6. Epub 2024 Jul 29. Eur J Phys Rehabil Med. 2024. PMID: 39073359 Free PMC article.

References

1. Colthorpe K, Sharifirad T, Ainscough L, Anderson S, Zimbardi K. Prompting undergraduate students’ metacognition of learning: implementing “meta-learning’ assessment tasks in the biomedical sciences. Assess. Eval. High. Edu. 2018;43:272–285. doi: 10.1080/02602938.2017.1334872. - DOI
1. Derry SJ, Murphy DA. Designing systems that train learning-ability - from theory to practice. Rev. Educ. Res. 1986;56:1–39. doi: 10.3102/00346543056001001. - DOI
1. Anne Pirrie A, Thoutenhoofd E. Learning to learn in the European Reference Framework for lifelong learning. Oxf. Rev. Educ. 2013;39:609–626. doi: 10.1080/03054985.2013.840280. - DOI
1. Mazzoni P, Krakauer JW. An implicit plan overrides an explicit strategy during visuomotor adaptation. J. Neurosci. 2006;26:3642–3645. doi: 10.1523/JNEUROSCI.5317-05.2006. - DOI - PMC - PubMed
1. McDougle SD, Bond KM, Taylor JA. Explicit and implicit processes constitute the fast and slow processes of sensorimotor learning. J. Neurosci. 2015;35:9568–9579. doi: 10.1523/JNEUROSCI.5061-14.2015. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

R21 NS120274/NS/NINDS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reinforcement learning establishes a minimal metacognitive process to monitor and control motor learning performance

Affiliations

Reinforcement learning establishes a minimal metacognitive process to monitor and control motor learning performance

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources