Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun;22(6):1511-27.
doi: 10.1162/neco.2010.08-09-1080.

Hyperbolically discounted temporal difference learning

Affiliations

Hyperbolically discounted temporal difference learning

William H Alexander et al. Neural Comput. 2010 Jun.

Abstract

Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of temporal discounting, such as temporal difference learning, assume that future outcomes are discounted exponentially. Exponential discounting has been preferred largely because it can be expressed recursively, whereas hyperbolic discounting has heretofore been thought not to have a recursive definition. In this letter, we define a learning algorithm, hyperbolically discounted temporal difference (HDTD) learning, which constitutes a recursive formulation of the hyperbolic model.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Learned value and hazard functions for the HDTD model compared with same from the non-recursive hyperbolic discounting model (κ = 0.15). For a reward given at t=30 (vertical line), both the hyperbolic discounting model and HDTD have the same value function. The HDTD model learns the appropriate value function over the course of multiple (1000) trials. Similarly, the HDTD hazard function corresponds exactly with the hyperbolic discounting hazard function.
Figure 2
Figure 2
Behavior of the HDTD model (A) when the discounting factor is not scaled by estimated reward per trial (eq. 2.4, κ = 0.2), and (B) when the discounting factor is scaled by the estimated reward per trial(eq. 2.6, κ = 0.2, σ = 1). The HDTD model reverses preferences (B) depending on the temporal proximity of two unequal rewards. When a small reward is immediately available (t1), the value function for that reward (solid line) is higher than for a larger delayed reward (dashed line). However, when the distance to both rewards is increased (t2), the preferences reverse; the value function for the larger reward is higher than for the smaller.
Figure 3
Figure 3
The HDTD model and average reward TD learning were fit to data from Brunner, 1999. A) Rewards were delivered according to two schedules, increasing (top) and decreasing (bottom). The average reward for both schedules is the same. B) The average reward TD model is indifferent to reward schedule, while the HDTD model strongly prefers the decreasing reward schedule at short delays, in accordance with Brunner, 1999. The best-fit parameters for the HDTD model are κ = 0.544, σ = 0.741, and φ = 54.85. Parameters found for the average reward TD model were θ = 0.0010, φ = 0.9841, and α (learning parameter) = 0.0986. The fit of the HDTD model yielded a mean-square error of 0.0050, while the fit of the average reward model yielded a MSE of 0.1226. Data were approximated from Brunner, 1999, figure 1.

References

    1. Amador N, Schlag-Rey M, et al. Reward-predicting and reward-detecting neuronal activity in the primate supplementary eye field. J Neurophysiol. 2000;84(4):2166–2170. - PubMed
    1. Brunner D. Preference for sequences of rewards: further tests of a parallel discounting model. Behavioural Processes. 1999;45(1–3):87–99. - PubMed
    1. Daw ND, Touretzky DS. Behavioral considerations suggest an average reward TD model of the dopamine system. Neurocomputing: An International Journal. 2000;32–33:679–684.
    1. Daw ND, Touretzky DS. Long-term reward prediction in TD models of the dopamine system. Neural Comput. 2002;14(11):2567–2583. - PubMed
    1. Green L, Myerson J. Exponential Versus Hyperbolic Discounting of Delayed Outcomes: Risk and Waiting Time. Amer. Zool. 1996;36(4):496–505.

Publication types