. 2022 Apr;14(2):311-326.

doi: 10.1111/tops.12595. Epub 2022 Jan 10.

A Critical Period for Robust Curriculum-Based Deep Reinforcement Learning of Sequential Action in a Robot Arm

Roy de Kleijn¹, Deniz Sen², George Kachergis³

Affiliations

¹ Leiden Institute for Brain and Cognition, Leiden University.
² Mathematical Institute, Leiden University.
³ Language & Cognition Lab, Stanford University.

PMID: 35005844
PMCID: PMC9303318
DOI: 10.1111/tops.12595

A Critical Period for Robust Curriculum-Based Deep Reinforcement Learning of Sequential Action in a Robot Arm

Roy de Kleijn et al. Top Cogn Sci. 2022 Apr.

. 2022 Apr;14(2):311-326.

doi: 10.1111/tops.12595. Epub 2022 Jan 10.

Authors

Roy de Kleijn¹, Deniz Sen², George Kachergis³

Affiliations

¹ Leiden Institute for Brain and Cognition, Leiden University.
² Mathematical Institute, Leiden University.
³ Language & Cognition Lab, Stanford University.

PMID: 35005844
PMCID: PMC9303318
DOI: 10.1111/tops.12595

Abstract

Many everyday activities are sequential in nature. That is, they can be seen as a sequence of subactions and sometimes subgoals. In the motor execution of sequential action, context effects are observed in which later subactions modulate the execution of earlier subactions (e.g., reaching for an overturned mug, people will optimize their grasp to achieve a comfortable end state). A trajectory (movement) adaptation of an often-used paradigm in the study of sequential action, the serial response time task, showed several context effects of which centering behavior is of special interest. Centering behavior refers to the tendency (or strategy) of subjects to move their arm or mouse cursor to a position equidistant to all stimuli in the absence of predictive information, thereby reducing movement time to all possible targets. In the current study, we investigated sequential action in a virtual robotic agent trained using proximal policy optimization, a state-of-the-art deep reinforcement learning algorithm. The agent was trained to reach for appearing targets, similar to a serial response time task given to humans. We found that agents were more likely to develop centering behavior similar to human subjects after curricularized learning. In our curriculum, we first rewarded agents for reaching targets before introducing a penalty for energy expenditure. When the penalty was applied with no curriculum, many agents failed to learn the task due to a lack of action space exploration, resulting in high variability of agents' performance. Our findings suggest that in virtual agents, similar to infants, early energetic exploration can promote robust later learning. This may have the same effect as infants' curiosity-based learning by which they shape their own curriculum. However, introducing new goals cannot wait too long, as there may be critical periods in development after which agents (as humans) cannot flexibly learn to incorporate new objectives. These lessons are making their way into machine learning and offer exciting new avenues for studying both human and machine learning of sequential action.

Keywords: Curriculum learning; Movement optimization; Reinforcement learning; Robotic arm control; Sequential action.

PubMed Disclaimer

Figures

**Fig. 1**
Participants spent increasingly more time during the ISI near the optimal position, equidistant from all possible stimulus locations. Distances in pixels. Shaded regions represent 95% CI. Data from de Kleijn et al. (2018b).

**Fig. 2**
The virtual environment with the target in green. The arm with a blue hand is moving toward the target. Targets could appear in one of four randomly chosen locations.

**Fig. 3**
Mean cumulative reward per episode. Shaded areas represent the SD.

**Fig. 4**
(a) Cumulative reward versus mean distance to optimal position and (b) cumulative reward versus total distance moved for 20 episodes from 5M to 6M for each run.

**Fig. 5**
Mean response time (in time steps) to touch the target. Shaded area represents SD. Note the log scale of the y‐axis.

**Fig. 6**
Absolute distance moved per time step (a) and absolute distance from the optimal position (b), minimal and equidistant from all possible target locations. Shaded regions represent SD.

See this image and copyright information in PMC

References

1. Abadi, M. , Agarwal, A. , Barham, P. , Brevdo, E. , Chen, Z. , Citro, C. , et al. (2015). TensorFlow: Large‐scale machine learning on heterogeneous systems. Software available from tensorflow.org.
1. Atkeson, C. G. , Hale, J. G. , Pollick, F. , Riley, M. , Kotosaka, S. , Schaul, S. , et al. (2000). Using humanoid robots to study human behavior. IEEE Intelligent Systems and Their Applications, 15(4), 46–56.
1. Baranes, A. , & Oudeyer, P.‐Y. (2013). Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61(1), 49–73.
1. Bell‐Berti, F. , & Harris, K. S. (1979). Anticipatory coarticulation: Some implications from a study of lip rounding. Journal of the Acoustical Society of America, 65, 1268–1270. - PubMed
1. Bengio, Y. , Louradour, J. , Collobert, R. , & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, (pp. 41–48). New York, NY: ACM.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Critical Period for Robust Curriculum-Based Deep Reinforcement Learning of Sequential Action in a Robot Arm

Affiliations

A Critical Period for Robust Curriculum-Based Deep Reinforcement Learning of Sequential Action in a Robot Arm

Authors

Affiliations

Abstract

Figures

Similar articles

References

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources