Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

Nicolás Navarro-Guerrero¹, Robert J Lowe^{2

3}, Stefan Wermter¹

Affiliations

¹ Knowledge Technology, Informatics Department, University of Hamburg, Hamburg, Germany.
² Division of Cognition and Communication, Department of Applied IT, University of Gothenburg, Gothenburg, Sweden.
³ Interaction Lab, School of Informatics, University of Skövde, Skövde, Sweden.

PMID: 28420976
PMCID: PMC5376586
DOI: 10.3389/fnbot.2017.00010

Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

Nicolás Navarro-Guerrero et al. Front Neurorobot. 2017.

. 2017 Apr 3:11:10.

doi: 10.3389/fnbot.2017.00010. eCollection 2017.

Authors

Nicolás Navarro-Guerrero¹, Robert J Lowe^{2

3}, Stefan Wermter¹

Affiliations

¹ Knowledge Technology, Informatics Department, University of Hamburg, Hamburg, Germany.
² Division of Cognition and Communication, Department of Applied IT, University of Gothenburg, Gothenburg, Sweden.
³ Interaction Lab, School of Informatics, University of Skövde, Skövde, Sweden.

PMID: 28420976
PMCID: PMC5376586
DOI: 10.3389/fnbot.2017.00010

Abstract

Both nociception and punishment signals have been used in robotics. However, the potential for using these negatively valenced types of reinforcement learning signals for robot learning has not been exploited in detail yet. Nociceptive signals are primarily used as triggers of preprogrammed action sequences. Punishment signals are typically disembodied, i.e., with no or little relation to the agent-intrinsic limitations, and they are often used to impose behavioral constraints. Here, we provide an alternative approach for nociceptive signals as drivers of learning rather than simple triggers of preprogrammed behavior. Explicitly, we use nociception to expand the state space while we use punishment as a negative reinforcement learning signal. We compare the performance-in terms of task error, the amount of perceived nociception, and length of learned action sequences-of different neural networks imbued with punishment-based reinforcement signals for inverse kinematic learning. We contrast the performance of a version of the neural network that receives nociceptive inputs to that without such a process. Furthermore, we provide evidence that nociception can improve learning-making the algorithm more robust against network initializations-as well as behavioral performance by reducing the task error, perceived nociception, and length of learned action sequences. Moreover, we provide evidence that punishment, at least as typically used within reinforcement learning applications, may be detrimental in all relevant metrics.

Keywords: inverse kinematics; nociception; punishment; reinforcement learning; self-protective mechanisms.

PubMed Disclaimer

Figures

**Figure 1**
**(A) Top view of the NAO robot facing left**. The left arm is depicted in different positions and a blue line is superimposed to indicate the boundaries of the end-effector workspace. **(B)** Depiction of the target and end-effector coordinates of the randomly generated training set. Blue dots represent targets, whereas red asterisks represent end-effector initial positions. **(C)** The histograms show the initial distance between the end-effector and the corresponding target.

**Figure 2**
**The neural architecture used for inverse kinematics learning**. For clarity, only one connection weight is shown (arrow between neuron layers). The hidden layers for both the Actor (left-hand side) and the Critic (right-hand side) are independently tuned. Solid units and connection weights in black correspond to the baseline, i.e., the *reward only* condition, and are extended by the other 3 conditions. The punishment feedback given to the critic and depicted in red is only used for the *reward* + *punishment*, and *reward* + *punishment* + *nociception* conditions. Blue dashed units and blue dashed connection weights are only considered under the *reward* + *nociception*, and *reward* + *punishment* + *nociception* conditions. During training a* is performed. a* is determined based on the exploration of action a as described in equation (2). The Critic is trained every time step based on the TD error δ, while the actor is trained only if the TD error is positive.

**Figure 3**
**A representative example of all four conditions of the fitness distribution over generations;** the results for each particular condition can be found in the Supplementary Material Navarro-Guerrero et al., . The fitness is directly computed from the total distance to the target on the testing set once learning has been concluded, thus the lower the value the better.

**Figure 4**
**Mean positioning error of the best individuals of each condition**. All conditions but *reward* + *punishment* converge fast and reach small positioning error in a small number of epochs.

**Figure 5**
**Average measured nociception for both joints, for all samples in the validation set and for the best individuals of each condition**. All four conditions show an oscillatory convergent behavior. The condition trained with reward and nociceptive units converges to a smaller value than all other conditions.

**Figure 6**
**Average number of steps needed to position the robot’s arm for the best individuals of each condition**.

See this image and copyright information in PMC

Cited by

Neurorobotics-A Thriving Community and a Promising Pathway Toward Intelligent Cognitive Robots.
Krichmar JL. Krichmar JL. Front Neurorobot. 2018 Jul 16;12:42. doi: 10.3389/fnbot.2018.00042. eCollection 2018. Front Neurorobot. 2018. PMID: 30061820 Free PMC article.
Editorial: Cognitive inspired aspects of robot learning.
Cruz F, Solis MA, Navarro-Guerrero N. Cruz F, et al. Front Neurorobot. 2023 Aug 24;17:1256788. doi: 10.3389/fnbot.2023.1256788. eCollection 2023. Front Neurorobot. 2023. PMID: 37692887 Free PMC article. No abstract available.

References

1. Akayama S., Matsunaga N., Kawaji S. (2006). “Experimental analysis and modeling of superficial pain on upper limb,” in International Joint Conference SICE-ICASE (Busan, South Korea: IEEE), 2891–2894.
1. Balkenius C., Winberg S. (2008). “Fast learning in an actor-critic architecture with reward and punishment,” in Scandinavian Conference on Artificial Intelligence: (SCAI), Volume 173 of Frontiers in Artificial Intelligence and Applications (Stockholm, Sweden: IOS Press), 20–27.
1. Blessing W. W., Benarroch E. E. (2012). “Lower brainstem regulation of visceral, cardiovascular, and respiratory function,” in The Human Nervous System, Volume VI: Systems, 3rd Edn, Chap. 29, eds Mai J. K., Paxinos G. (San Diego: Academic Press; ), 1058–1073.
1. Bonica J. J. (1979). The need of a taxonomy. Pain 6, 247–248.10.1016/0304-3959(79)90046-0 - DOI - PubMed
1. Brooks J., Tracey I. (2005). REVIEW: from nociception to pain perception: imaging the spinal and supraspinal pathways. J. Anat. 207, 19–33.10.1111/j.1469-7580.2005.00428.x - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

Affiliations

Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources