Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 12:7:4.
doi: 10.3389/fnbot.2013.00004. eCollection 2013.

A biologically plausible embodied model of action discovery

Affiliations

A biologically plausible embodied model of action discovery

Rufino Bolado-Gomez et al. Front Neurorobot. .

Abstract

During development, animals can spontaneously discover action-outcome pairings enabling subsequent achievement of their goals. We present a biologically plausible embodied model addressing key aspects of this process. The biomimetic model core comprises the basal ganglia and its loops through cortex and thalamus. We incorporate reinforcement learning (RL) with phasic dopamine supplying a sensory prediction error, signalling "surprising" outcomes. Phasic dopamine is used in a cortico-striatal learning rule which is consistent with recent data. We also hypothesized that objects associated with surprising outcomes acquire "novelty salience" contingent on the predicability of the outcome. To test this idea we used a simple model of prediction governing the dynamics of novelty salience and phasic dopamine. The task of the virtual robotic agent mimicked an in vivo counterpart (Gancarz et al., 2011) and involved interaction with a target object which caused a light flash, or a control object which did not. Learning took place according to two schedules. In one, the phasic outcome was delivered after interaction with the target in an unpredictable way which emulated the in vivo protocol. Without novelty salience, the model was unable to account for the experimental data. In the other schedule, the phasic outcome was reliably delivered and the agent showed a rapid increase in the number of interactions with the target which then decreased over subsequent sessions. We argue this is precisely the kind of change in behavior required to repeatedly present representations of context, action and outcome, to neural networks responsible for learning action-outcome contingency. The model also showed cortico-striatal plasticity consistent with learning a new action in basal ganglia. We conclude that action learning is underpinned by a complex interplay of plasticity and stimulus salience, and that our model contains many of the elements for biological action discovery to take place.

Keywords: action selection; basal ganglia; intrinsic motivation; operant behavior; phasic dopamine; reinforcement learning; synaptic plasticity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(A) Scheme for learning action-outcome associations—see text for details. (B) Loops through basal ganglia, thalamus, and cortex performing action selection in the animal brain. Two competing action channels are shown. The channel on the left encoding action 1 has a higher salience than that for channel 2. It has “won” the competition for behavioral expression in basal ganglia which has therefore released inhibition on its thalamic channel target, thereby allowing the corresponding thalamo-cortical loop to build up activity. Blue/red lines show inhibition/excitation, respectively and the width of lines encodes signal strength.
Figure 2
Figure 2
In vivo experimental paradigm of Gancarz et al. (2011) (panel A) and our embodied in silico counterpart (panel B). (A) Shows the small test chamber used with rats undergoing instrumental learning. One side of the chamber has two poke holes with a light above them. Rat snout entry into the “active” poke hole may cause the two lights to flash and the active hole may be either one (for a particular rat). (B) Shows the virtual world created as a counterpart to that in (A). A simulated Khepera I robot replaces the rat, and snout holes are replaced by colored blocks. Only the red block is ever designated the active one, and the white block corresponds to the inactive poke hole. There is a point-light located at the top of the red block which may flash if the robot bumps into the red block.
Figure 3
Figure 3
Behavioral data adapted from the in vivo studies of Gancarz et al. (2011) (study 1) and Lloyd et al. (2012) (study 2). (A,B) For variable interval (VI) training from study 1. (A) Shows the number of inactive and active responses in each 2-day period (averaged over the two 30 min sessions therein) with white and black symbols, respectively. The habituation and response contingent phases (see text) are designated “H” and “RC,” respectively, and the average response during the response contingent phase is shown on the extreme right as “Avg.” (B) Shows the within-session behavior during the response contingent phase. Results are averaged over all 10 days of this phase and means are reported for each epoch of 6 min duration during the 30 min sessions. Error bars in both panels are the mean of the standard errors for the low and high responding animals (as originally reported in study 1). (C) Shows active responses (star-shaped data points) from a fixed-ratio (FR1) schedule reported in study 2. Also shown for comparison are the active response in (A) (black squares). Note, there were more days in the habituation phase of study 2, and error bars in the habituation phase are not shown. (D) Is a counterpart to (B) with FR1 data shown by stars, and the VI data from (B), replicated for comparison (black squares).
Figure 4
Figure 4
The virtual robot control architecture, and its interaction with the robot and environment. The virtual Khepera robot is endowed a range of sensors and the motor output is locomotion via a pair of wheels. The architecture is split into embedding, and biomimetic core, components. The embedding architecture contains three action-subsystems: two for approaching-and-bumping into each of the red and white blocks (“interact red block,” “interact white block”), and one (“explore”) for randomly roaming the arena while avoiding object contact. Within each action subsystem the motor command units are designated “motor comm.” The biomimetic core contains a biologically plausible circuit (representing basal ganglia, and its connectivity with cortex, thalamus, and brainstem), a phasic stimulus prediction mechanism, a source of phasic dopamine, and the new learning rules for basal ganglia plasticity. Other symbols and components are labeled as in the main text.
Figure 5
Figure 5
Prediction and its deployment for novelty salience and sensory prediction error under a simple phenomenological model. (A) The red markers indicate the presence or absence of phasic outcome (light flash) during each interaction with the red (active) block. The latent prediction, y(*)f(t), is shown as the solid line and the phasic prediction, y*f(ti), by the open markers. (B) The translation of prediction into novelty salience. (C) The time course of novelty salience corresponding to the prediction in (A), obtained via the mapping in (B). Open circles represent the salience perceived at each block interaction, when the block is in view. These bouts of block-perception are longer than the observation of the light flash, but we identify each interaction with a point-time marker for simplicity. The continuous line is a formal mapping of the latent prediction using Equation (3). (D) The sensory error signal derived from (A).
Figure 6
Figure 6
Schematic diagram of the basal ganglia neural network component of the biomimetic core. (A) Cortex, basal ganglia, brainstem, and thalamic complex. The latter is comprised of the thalamic reticular nucleus (TRN) and ventrolateral thalamus (VL). Note that action channels are present but not explicitly shown here. (B) The basal ganglia circuit consisting of: striatal projection neurons expressing D1 or D2 dopamine receptors; subthalamic nucleus (STN); output nuclei—globus pallidus internal segment (GPi) and substantia nigra pars reticulata (SNr); globus pallidus external segment (GPe), and substantia nigra pars compacta (SNc). The three action channels are shown in this panel, and a typical set of activities indicated in cartoon form by the gray bars (the channel on the left is highly salient causing suppression of basal ganglia output on that channel). The summation box below STN is not anatomically present—it is graphic device to indicate that each target of STN sums its inputs across channels from STN.
Figure 7
Figure 7
Construction of the learning rule. (A) The plasticity coefficients consistent with the data of Shen et al. (2008). (B) The dopamine mixing function α(d) defined in Equation (11). (C,D) The dopamine-dependent versions of the factors CBCM in Equation (10) for D1 and D2-MSNs, respectively.
Figure 8
Figure 8
Behavior of an agent with no novelty salience or internal prediction model, performing the block-bumping experiment. (A,B) For models with, and without, phasic dopamine, respectively (pDA, no-pDA), and each plot is an average over 10 runs. These plots are based on those of the in vivo data in Figure 3. Thus, each panel shows the number of interactions with the block stimuli in each 15 min session comprising a “virtual day” of learning, plotted against such days. Error bars are 1 standard error of the mean. Open symbols are for the white (control) block while solid symbols are for the red block, which elicits a phasic outcome in the response contingent phase (labeled “RC”). The habituation phase (when there is no environmental phasic outcome) is designated “H.” The average of the interactions for each block over the entire response contingent phase is shown in the pair of data points on the extreme right of each panel.
Figure 9
Figure 9
Behavior of an agent with novelty salience and feature prediction performing the block-bumping experiment with variable interval training. (A,C) Have a similar interpretation to counterparts in Figure 8 and are for pDA and no-pDA models, respectively. (B,D) Show the behavior within a virtual “day” (considered as three, 5 min epochs), averaged over the response contingent phase; (B,D) are for pDA and no-pDA, respectively.
Figure 10
Figure 10
Weight trajectories w*(t) for the active response channel, in models with novelty salience and prediction, undergoing variable interval training. Rows are for pDA and no-pDA models, columns for D1- and D2-type MSNs. Weights from motor cortex and sensory cortex are labeled “motor” and “sensory.” The trajectories are expressed as continuous functions of time to show both within-day, and between-day dynamics, and the onset of the response contingent phase (at the start of day 6) is indicated at 75 min. These plots capture the statistics of the weights over a group of 10 models; the dark red line is the mean, and the red-shaded region encompasses ± 1 std dev.
Figure 11
Figure 11
Signals governing learning in pDA models with novelty salience and prediction, undergoing variable interval training. (A) Shows the novelty salience (solid green line) and prediction signal (dashed black line) during the response contingent (RC) phase in a similar way to that used in Figure 5, but here the symbols have been omitted. (B) Is similar to (A), but for a smaller temporal window immediately after the onset of the RC phase. (D) Shows the phasic dopamine signal corresponding to the events in (A). (C) Is similar to (D) and relates to events in panel (B).
Figure 12
Figure 12
Behavior of an agent with novelty salience and feature prediction performing the block-bumping experiment with fixed-ratio training. All panels have the same significance as their counterparts for variable interval training in Figure 9. Thus, (A,B) are for pDA models, whereas (C,D) are for no-pDA models.
Figure 13
Figure 13
Weight trajectories w*(t) for the active response channel, in models with novelty salience and prediction, undergoing fixed-ratio (FR-1) training. All panels have the same significance as their counterparts for variable interval training in Figure 10. Note the different scale for D1 and D2-MSNs.
Figure 14
Figure 14
Internal variables and phasic dopamine signals for a model with novelty salience and prediction undergoing fixed interval training. (A,B) Show novelty salience and the prediction signal, and are counterparts to Figures 11B,C. (C,D) Show phasic dopamine and are counterparts to Figures 11A,D.

References

    1. Alexander W. H., Sporns O. (2002). An embodied model of learning, plasticity, and reward. Adaptive Behav. 10, 143–159
    1. Baldassarre G., Mannella F., Fiore V. G., Redgrave P., Gurney K., Mirolli M. (2013). Intrinsically motivated action-outcome learning and goal-based action recall: a system-level bio-constrained computational model. Neural Netw. [Epub ahead of print]. 10.1016/j.neunet.2012.09.015 - DOI - PubMed
    1. Bar M. (2007). The proactive brain: using analogies and associations to generate predictions. Trends Cogn. Sci. 11, 280–289 10.1016/j.tics.2007.05.005 - DOI - PubMed
    1. Barto A. G. (1995). Adaptive critics and the basal ganglia, in Models of Information Processing in the Basal Ganglia, eds Houk J. C., Davis J., Beiser D. (Cambridge, MA: MIT Press; ), 215–232
    1. Barto A. G., Singh S., Chantanez N. (2004). Intrinsically motivated reinforcement learning, in 18th Annual Conference on Neural Information Processing Systems (NIPS) (Vancouver, BC: ).