Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov 22:1:173.
doi: 10.3389/fpsyg.2010.00173. eCollection 2010.

Credit assignment in multiple goal embodied visuomotor behavior

Affiliations

Credit assignment in multiple goal embodied visuomotor behavior

Constantin A Rothkopf et al. Front Psychol. .

Abstract

The intrinsic complexity of the brain can lead one to set aside issues related to its relationships with the body, but the field of embodied cognition emphasizes that understanding brain function at the system level requires one to address the role of the brain-body interface. It has only recently been appreciated that this interface performs huge amounts of computation that does not have to be repeated by the brain, and thus affords the brain great simplifications in its representations. In effect the brain's abstract states can refer to coded representations of the world created by the body. But even if the brain can communicate with the world through abstractions, the severe speed limitations in its neural circuitry mean that vast amounts of indexing must be performed during development so that appropriate behavioral responses can be rapidly accessed. One way this could happen would be if the brain used a decomposition whereby behavioral primitives could be quickly accessed and combined. This realization motivates our study of independent sensorimotor task solvers, which we call modules, in directing behavior. The issue we focus on herein is how an embodied agent can learn to calibrate such individual visuomotor modules while pursuing multiple goals. The biologically plausible standard for module programming is that of reinforcement given during exploration of the environment. However this formulation contains a substantial issue when sensorimotor modules are used in combination: The credit for their overall performance must be divided amongst them. We show that this problem can be solved and that diverse task combinations are beneficial in learning and not a complication, as usually assumed. Our simulations show that fast algorithms are available that allot credit correctly and are insensitive to measurement noise.

Keywords: credit assignment; learning; modules; reinforcement; reward.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Value functions (Top row) and their associated policies (Bottom row) for each of three modules. These functions for obstacle avoidance, litter collection, and sidewalk preference in left to right order have been learned by a virtual avatar walking along a sidewalk strewn with litter and obstacles. The red disk marks the state estimate uncertainty for each of them for a particular moment in the traverse.
Figure 2
Figure 2
Human gaze data for the same environment showing striking evidence for visual routines. Humans in the same environment as the avatar precisely manipulate gaze location depending on the specific task goal. The small black dots show the location of all fixation points on litter and obstacles. When avoiding obstacles (left) gaze points cluster at the edges of the object. When picking up a similar object (right) gaze points cluster on the center. From Rothkopf and Ballard (2009).
Figure 3
Figure 3
Module-based gaze allocation. Modules compete for gaze in order to update their measurements. (A) A caricature of the basic method for a given module.The trajectory through the agent's state space is estimated using Kalman filter that propagates estimates in the absence of measurements and, as a consequence, build up uncertainty (large shaded area). If the behavior succeeds in obtaining a fixation, state space uncertainty is reduced (smaller shaded area). The reinforcement learning model allows the value of reducing uncertainty to be calculated. (B)The Sprague model out performs other models. Bars, left to right: Sprague model (1), round-robin (2), random selection (3).
Figure 4
Figure 4
Schematic representation of the modular credit assignment problem. (A) In any period during behavior there is only a subset of the total module set that is active. We term these periods episodes. In the timecourse of behavior, modules that are needed become active and those that are no longer needed become inactive. The vertical depicts two sequential episodes of three modules each, denoted with different shadings. The vertical arrows denote the scheduler's action in activating and deactivating modules. Our formal results only depend on each module being chosen sufficiently often and not on the details of the selection strategy. The same module may be selected in sequential episodes. (B) A fundamental problem for a biological agent using a modular architecture. At any given instant, shown with dotted lines, when multiple modules are active and only a global reward signal G is available, the modules each have to be able to calculate how much of the rewards is due to their activation. This is known as the credit assignment problem. Our setting simplifies the problem by assuming that individual reinforcement learning modules are independent and communicate only their estimates of their reward values.
Figure 5
Figure 5
Predator-prey grid-world example following SinghCohn, 1998. An agent is located on a 5×5 grid and searches to find three different food sources f1 to f3 and tries to avoid the predator p, which moves every other time step toward the agent.
Figure 6
Figure 6
Comparison of learning progress. (A) Effect of different learning rates and the variance weighted learning (“Var.w.”) on the accumulated reward over episodes for the simulated foraging agent for the case of only knowing global reward Gt. For comparison, a case where the rewards given to each module were known is shown as the “observed rewards” trace. (B) Root mean squared error error between the true rewards and the reward estimates of all behaviors over episodes. The three curves correspond to the different learning rates β in Eq. 15. (C) Root mean squared error between the true value function and the learned value functions of all behaviors over trials.
Figure 7
Figure 7
The walkway navigation tasks. Left: typical view from the agent while navigating along the walkway. The three possible tasks are following the walkway, avoiding obstacles, which are the dark cylinders, and picking up litter, corresponding to the light cylinders. Right: Schematic representation of the statespace parameterization for the learners. Each module represents the distance to the closest object in the field of view and the angle between the current heading direction and the object center axis. The module learning the walkway behavior uses the signed distance to the midline of the walkway and the angle between the heading direction and the vector in the direction of the walkway.
Figure 8
Figure 8
Reward calculations for the walkway navigation task for the three component behaviors using the credit assignment algorithm. (A) Top row: Initial estimates of the reward functions. Bottom row: Final reward estimates. (B) Time course of learning reward functions for each of the three component behaviors. RMS error between true and calculated reward as a function of iteration number.
Figure 9
Figure 9
Representations of value functions and policies in the walkway navigation task for the three component behaviors. (A) Top row: initial value function estimates V^(s). Bottom row: final value estimates. (B) Representations of policies. Top row: initial policy estimates V^(s).. Bottom row: final policy estimates. The navigation actions are coded as follows: left turns are red, straight ahead is light green, right turns are blue.

Similar articles

Cited by

References

    1. Anderson J. (1983). The Architecture of Cognition. Cambridge, MA: Harvard University Press
    1. Arkin R. (1998). Behavior Based Robotics. Cambridge, MA: MIT Press
    1. Badler N., Palmer M., Bindiganavale R. (1999). Animation control for real-time virtual humans. Commun. ACM 42, 64–7310.1145/310930.310975 - DOI - PubMed
    1. Badler N. I., Phillips C. B., Webber B. L. (1993). Simulating Humans: Computer Graphics Animation and Control. New York, NY: Oxford University Press
    1. Ballard D. H., Hayhoe M. M., Pook P. K., Rao R. P. N. (1997). Deictic codes for the embodiment of cognition. Behav. Brain Sci. 20, 723–76710.1017/S0140525X97001611 - DOI - PubMed