Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2014 Mar 12:8:76.
doi: 10.3389/fnbeh.2014.00076. eCollection 2014.

Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective

Affiliations
Review

Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective

Giles W Story et al. Front Behav Neurosci. .

Abstract

The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a "model-based" (or goal-directed) system and a "model-free" (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes.

Keywords: addiction; discounting; habit; health; hyperbolic; model-based; model-free; preference reversal.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hyperbolic discounting predicts myopic preference reversal. Discounted value, V(A, t, τ) under three discount functions is plotted of as a function of the decision maker's position in time, τ, where A is the magnitude of the outcome (its instantaneous utility) and t the time at which it is due to be delivered. A larger-later reward, LL, of magnitude, l, is due to be received at t3 and a smaller-sooner reward, SS, of magnitude, s, is due to be received at t2. (A) Exponential discounting. The decision-maker has consistent preferences, such that the ratio of the value of two rewards is constant irrespective of how far away the options are in time; in this case the decision-maker always prefers the larger later reward (i.e., V(l, t3, τ) > V (s, t2, τ) for all τ < t3). (B) Hyperbolic discounting with a low discount rate. The ratio of the value of two rewards is no longer constant as a function of τ. The hyperbolic discount rate, k, governs the steepness of the curvature. Here, where k is low (k = 0.3) the larger later reward is still preferred, even when the smaller sooner reward is immediately available. (C) Hyperbolic discounting with a high discount rate. At t1, when both rewards are distant, the larger later reward is preferred, i.e., V(l, t3, t1) > V (s, t2, t1), however the smaller sooner reward becomes increasingly desirable as it approaches in time, such that at t2, the immediately available smaller reward is preferred, i.e., V(l, t3, t2) < V(s, t2, t2). This prediction of hyperbolic discounting has been proposed to underlie the observation that individuals make far-sighted plans for the distant future, but often renege on those plans in favor of short-term gratification when the future arrives.
Figure 2
Figure 2
Interactions between model-based and model-free decision-making. Action values for a hypothetical agent, a person following a diet-plan, deciding whether or not to consume biscuits when presented with a cue, the biscuit tin. The agent's choice combines model-based and model-free value. (A) A decision-tree (semi-Markov state-space) represented by the model-based system when the agent considers the decision from state, P at a time interval, dp, in advance of encountering the biscuit tin, denoted by state B. Alternative courses of action at B, to consume or to abstain, are evaluated by searching through the tree of future possibilities. The choice to consume is followed after a short delay, dc, with a food reward, Rc, associated with consumption, denoted by the state C, followed after a longer delay, dh, by the maintenance of current body weight, denoted by the unrewarded state, U. The choice to abstain is followed after delay, dc, by the unrewarded state A, followed after delay, dh, by a health benefit with reward, Rh, in the form of weight loss. The agent is naïve to the parallel effects of model-free learning when computing these reward estimates. Model-based action values, QMB, are given by the sum of future rewards following each action, discounted according to a function, D(t), assumed to be exponential and identical across both controllers. The equations below indicate that the model-based system in this instance is indifferent between consuming and abstaining at both P (left hand equation) and B (right hand equation). (B) Cached values stored by the model-free system, which reflect the result of prior experience with the outcomes. Neither the outcomes themselves, nor the transitions between them, are explicitly represented. Similarly, because the distant health consequences have never been experienced, they do not influence the model-free Q-values, QMF. As a result the model-free system prefers consumption at state B. (C) Model-based and model-free values are assumed to combine according to a weighted average, governed by the parameter, ω. At P, where model-free values have no influence, the agent is indifferent between consuming and abstaining. In the presence of the biscuit tin at B however the additional influence of model-free (cached) values induces a preference for consumption.

References

    1. Ainslie G. (1974). Impulse control in pigeons. J. Exp. Anal. Behav. 21, 485–489 10.1901/jeab.1974.21-485 - DOI - PMC - PubMed
    1. Ainslie G. (1975). Specious reward: a behavioral theory of impulsiveness and impulse control. Psychol. Bull. 82, 463–496 10.1037/h0076860 - DOI - PubMed
    1. Ainslie G. (2001). Breakdown of Will. Cambridge: Cambridge University Press; 10.1017/CBO9781139164191 - DOI
    1. Ainslie G., Haendel V. (1983). The motives of the will, in Etiologic aspects of alcohol and drug abuse, eds Gottheil E., Druley K. A., Skodola T. E., Waxman H. M. (Springfield, IL: Charles C. Thomas; ), 119–140
    1. Alexander W. H., Brown J. W. (2010). Hyperbolically discounted temporal difference learning. Neural Comput. 22, 1511–1527 10.1162/neco.2010.08-09-1080 - DOI - PMC - PubMed