. 2013 Nov 26:4:833.

doi: 10.3389/fpsyg.2013.00833. eCollection 2013.

Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots

Hung Ngo¹, Matthew Luciw, Alexander Förster, Jürgen Schmidhuber

Affiliations

Affiliation

¹ IDSIA, Dalle Molle Institute for Artificial Intelligence, Università della Svizzera Italiana-Scuola Universitaria Professionale della Svizzera Italiana (USI-SUPSI) Lugano, Switzerland.

PMID: 24324448
PMCID: PMC3840616
DOI: 10.3389/fpsyg.2013.00833

Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots

Hung Ngo et al. Front Psychol. 2013.

. 2013 Nov 26:4:833.

doi: 10.3389/fpsyg.2013.00833. eCollection 2013.

Authors

Hung Ngo¹, Matthew Luciw, Alexander Förster, Jürgen Schmidhuber

Affiliation

¹ IDSIA, Dalle Molle Institute for Artificial Intelligence, Università della Svizzera Italiana-Scuola Universitaria Professionale della Svizzera Italiana (USI-SUPSI) Lugano, Switzerland.

PMID: 24324448
PMCID: PMC3840616
DOI: 10.3389/fpsyg.2013.00833

Abstract

A reinforcement learning agent that autonomously explores its environment can utilize a curiosity drive to enable continual learning of skills, in the absence of any external rewards. We formulate curiosity-driven exploration, and eventual skill acquisition, as a selective sampling problem. Each environment setting provides the agent with a stream of instances. An instance is a sensory observation that, when queried, causes an outcome that the agent is trying to predict. After an instance is observed, a query condition, derived herein, tells whether its outcome is statistically known or unknown to the agent, based on the confidence interval of an online linear classifier. Upon encountering the first unknown instance, the agent "queries" the environment to observe the outcome, which is expected to improve its confidence in the corresponding predictor. If the environment is in a setting where all instances are known, the agent generates a plan of actions to reach a new setting, where an unknown instance is likely to be encountered. The desired setting is a self-generated goal, and the plan of action, essentially a program to solve a problem, is a skill. The success of the plan depends on the quality of the agent's predictors, which are improved as mentioned above. For validation, this method is applied to both a simulated and real Katana robot arm in its "blocks-world" environment. Results show that the proposed method generates sample-efficient curious exploration behavior, which exhibits developmental stages, continual learning, and skill acquisition, in an intrinsically-motivated playful agent.

Keywords: AI planning; artificial curiosity; continual learning; developmental robotics; intrinsic motivation; markov decision processes; online active learning; systematic exploration.

PubMed Disclaimer

Figures

**Figure 1**
**The Katana robot arm in its blocks-world environment**.

**Figure 2**
**Left**: (a–f) Examples illustrating the features that were used. **Right**: An example showing how the state and action are encoded (bottom) for a given blocks-world setting (top). See text for details.

**Figure 3**
**A single robot-environment interaction, illustrating a setting change**. Each pick and place “experiment” causes a change in setting. The outcome of the previous experiment was that the robot placed the blue block on top of the yellow block, and observed the label +1, corresponding to “stable.” Now (middle), the robot examines three fovea locations (t, t' and t”), each of which involves a query. The query is false for t' and t”, but true for t, and the robot immediately (greedily) grasps the furthest block, which happens to be the red one, and places it at the queried location. The action causes a change in setting to i + 1 and the outcome −1 is observed (“unstable”).

**Figure 4**
**Exploration history (averaged over 10 runs)**.

**Figure 5**
**KL-divergence between learned models and ground-truth models (averaged over 10 runs)**. Best viewed in color.

**Figure 6**
**How the focus of the self-generated exploration goals at height 1 changes over time as the learned predictive model gets closer to the true one**.

**Figure 7**
**How the focus of the self-generated exploration goals at height 2 changes over time as the learned predictive model gets closer to the true one**.

**Figure 8**
**Experience distribution after the last timestep (learning has completed) for heights 1–6**.

**Figure 9**
**A comparison of exploration methods in terms of the KL-divergence between the learned predictive models at each time step and their ground-truth models**. Results are averaged over 10 runs.

**Figure 10**
**Sample query sequence on real robot (1/3)**.

**Figure 11**
**Sample query sequence on real robot (2/3)**.

**Figure 12**
**Sample query sequence on real robot (3/3)**.

**Figure 13**
**Learning progress of the Katana robot arm's predictive models at height 1 and 2 after 30 settings**. Action 1 (no bits set) is the most unstable. Action 6 (all bits set) is the most stable. See earlier discussion on the features and Figure 2.

**Figure 14**
**A “tricky” situation to test the robot's stacking skill**. We show this case to illustrate the value of exploring to learn how the world works. Consider the robot is faced with a task to build a stack of blocks as fast as possible from this initial setting. Given its learned model of the world, the robot will decide to start stacking from height 1 instead of height 2, as with high probability the stack of two blocks will fall after placing another block upon them.

See this image and copyright information in PMC

References

1. Asada M., Hosoda K., Kuniyoshi Y., Ishiguro H., Inui T., Yoshikawa Y., et al. (2009). Cognitive developmental robotics: a survey. IEEE Trans. Auton. Ment. Dev. 1, 12–34 10.1109/TAMD.2009.2021702 - DOI
1. Atlas L. E., Cohn D. A., Ladner R. E. (1989). Training connectionist networks with queries and selective sampling, in Advances in Neural Information Processing Systems 2, ed Touretzky D. S. (Denver, CO: Morgan Kaufmann Publishers Inc.), 566–573
1. Auer P. (2003). Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 10.1162/153244303321897663 - DOI
1. Azoury K. S., Warmuth M. K. (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions. J. Mach. Learn. Res. 43, 211–246 10.1023/A:1010896012157 - DOI
1. Barto A., Singh S., Chentanez N. (2004). Intrinsically motivated learning of hierarchical collections of skills, in Proceedings of International Conference on Development and Learning (ICDL), (San Diego, CA: ), 112–119

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots

Affiliation

Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources