Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 3:11:10.
doi: 10.3389/fnbot.2017.00010. eCollection 2017.

Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

Affiliations

Improving Robot Motor Learning with Negatively Valenced Reinforcement Signals

Nicolás Navarro-Guerrero et al. Front Neurorobot. .

Abstract

Both nociception and punishment signals have been used in robotics. However, the potential for using these negatively valenced types of reinforcement learning signals for robot learning has not been exploited in detail yet. Nociceptive signals are primarily used as triggers of preprogrammed action sequences. Punishment signals are typically disembodied, i.e., with no or little relation to the agent-intrinsic limitations, and they are often used to impose behavioral constraints. Here, we provide an alternative approach for nociceptive signals as drivers of learning rather than simple triggers of preprogrammed behavior. Explicitly, we use nociception to expand the state space while we use punishment as a negative reinforcement learning signal. We compare the performance-in terms of task error, the amount of perceived nociception, and length of learned action sequences-of different neural networks imbued with punishment-based reinforcement signals for inverse kinematic learning. We contrast the performance of a version of the neural network that receives nociceptive inputs to that without such a process. Furthermore, we provide evidence that nociception can improve learning-making the algorithm more robust against network initializations-as well as behavioral performance by reducing the task error, perceived nociception, and length of learned action sequences. Moreover, we provide evidence that punishment, at least as typically used within reinforcement learning applications, may be detrimental in all relevant metrics.

Keywords: inverse kinematics; nociception; punishment; reinforcement learning; self-protective mechanisms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(A) Top view of the NAO robot facing left. The left arm is depicted in different positions and a blue line is superimposed to indicate the boundaries of the end-effector workspace. (B) Depiction of the target and end-effector coordinates of the randomly generated training set. Blue dots represent targets, whereas red asterisks represent end-effector initial positions. (C) The histograms show the initial distance between the end-effector and the corresponding target.
Figure 2
Figure 2
The neural architecture used for inverse kinematics learning. For clarity, only one connection weight is shown (arrow between neuron layers). The hidden layers for both the Actor (left-hand side) and the Critic (right-hand side) are independently tuned. Solid units and connection weights in black correspond to the baseline, i.e., the reward only condition, and are extended by the other 3 conditions. The punishment feedback given to the critic and depicted in red is only used for the reward + punishment, and reward + punishment + nociception conditions. Blue dashed units and blue dashed connection weights are only considered under the reward + nociception, and reward + punishment + nociception conditions. During training a* is performed. a* is determined based on the exploration of action a as described in equation (2). The Critic is trained every time step based on the TD error δ, while the actor is trained only if the TD error is positive.
Figure 3
Figure 3
A representative example of all four conditions of the fitness distribution over generations; the results for each particular condition can be found in the Supplementary Material Navarro-Guerrero et al., . The fitness is directly computed from the total distance to the target on the testing set once learning has been concluded, thus the lower the value the better.
Figure 4
Figure 4
Mean positioning error of the best individuals of each condition. All conditions but reward + punishment converge fast and reach small positioning error in a small number of epochs.
Figure 5
Figure 5
Average measured nociception for both joints, for all samples in the validation set and for the best individuals of each condition. All four conditions show an oscillatory convergent behavior. The condition trained with reward and nociceptive units converges to a smaller value than all other conditions.
Figure 6
Figure 6
Average number of steps needed to position the robot’s arm for the best individuals of each condition.

Similar articles

Cited by

References

    1. Akayama S., Matsunaga N., Kawaji S. (2006). “Experimental analysis and modeling of superficial pain on upper limb,” in International Joint Conference SICE-ICASE (Busan, South Korea: IEEE), 2891–2894.
    1. Balkenius C., Winberg S. (2008). “Fast learning in an actor-critic architecture with reward and punishment,” in Scandinavian Conference on Artificial Intelligence: (SCAI), Volume 173 of Frontiers in Artificial Intelligence and Applications (Stockholm, Sweden: IOS Press), 20–27.
    1. Blessing W. W., Benarroch E. E. (2012). “Lower brainstem regulation of visceral, cardiovascular, and respiratory function,” in The Human Nervous System, Volume VI: Systems, 3rd Edn, Chap. 29, eds Mai J. K., Paxinos G. (San Diego: Academic Press; ), 1058–1073.
    1. Bonica J. J. (1979). The need of a taxonomy. Pain 6, 247–248.10.1016/0304-3959(79)90046-0 - DOI - PubMed
    1. Brooks J., Tracey I. (2005). REVIEW: from nociception to pain perception: imaging the spinal and supraspinal pathways. J. Anat. 207, 19–33.10.1111/j.1469-7580.2005.00428.x - DOI - PMC - PubMed

LinkOut - more resources