. 2015 Mar 27;11(3):e1004164.

doi: 10.1371/journal.pcbi.1004164. eCollection 2015 Mar.

Theory of choice in bandit, information sampling and foraging tasks

Bruno B Averbeck¹

Affiliations

PMID: 25815510
PMCID: PMC4376795
DOI: 10.1371/journal.pcbi.1004164

Theory of choice in bandit, information sampling and foraging tasks

Bruno B Averbeck. PLoS Comput Biol. 2015.

. 2015 Mar 27;11(3):e1004164.

doi: 10.1371/journal.pcbi.1004164. eCollection 2015 Mar.

Author

Bruno B Averbeck¹

Affiliation

¹ Laboratory of Neuropsychology, National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland, United States of America.

PMID: 25815510
PMCID: PMC4376795
DOI: 10.1371/journal.pcbi.1004164

Abstract

Decision making has been studied with a wide array of tasks. Here we examine the theoretical structure of bandit, information sampling and foraging tasks. These tasks move beyond tasks where the choice in the current trial does not affect future expected rewards. We have modeled these tasks using Markov decision processes (MDPs). MDPs provide a general framework for modeling tasks in which decisions affect the information on which future choices will be made. Under the assumption that agents are maximizing expected rewards, MDPs provide normative solutions. We find that all three classes of tasks pose choices among actions which trade-off immediate and future expected rewards. The tasks drive these trade-offs in unique ways, however. For bandit and information sampling tasks, increasing uncertainty or the time horizon shifts value to actions that pay-off in the future. Correspondingly, decreasing uncertainty increases the relative value of actions that pay-off immediately. For foraging tasks the time-horizon plays the dominant role, as choices do not affect future uncertainty in these tasks.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Bandit state space.**
A. A portion of the reward distribution tree, starting from a Beta(1,1) prior for one of the bandit options. As one of the options is chosen, the outcomes traverse this tree. The number at each node indicates the posterior over the number of rewards (numerator) and the number of times the option has been sampled (denominator). B. Product space across both bandit options. Blue lines (and fractions) indicate choice of option 1, red lines (and fractions) indicate choice of option 2. The numerator and denominator of the fractions are as in panel A and define the posterior probability of a reward. Thick lines show actions that would be taken from each node by an optimal policy, thin dashed lines show options that are not taken by an optimal policy. C. Distribution of reward probabilities (i.e. choices/rewards) over a finite horizon (N = 8 choices) starting from two different beta priors (Option 1: Beta(1,1) and Option 2: Beta(2,2)) which can be interpreted as different amounts of experience with the options. These priors correspond to being in the state 1/2:2/4 indicated in panel b with a box. The solid black bar under the x axis indicates q values for which p(q) is identical. Asterisks superimposed on the plots show the means of the two distributions (0.575 and 0.585 for option 2 and option 1 respectively).

**Fig 2. Two armed bandit example.**
Panels A and C are shown for a 50 trial fixed horizon model. A. Difference in action value between option 1 and option 2. Blue dots indicate rewarded choice, black dots indicate unrewarded choice (R+ = rewarded, R- = unrewarded). Bracket indicates difference in future expected value shown in panel B. Results are for a finite time horizon model with a 50 trial horizon and no discounting. The agent is not following the optimal policy in this example. Choices and outcomes were fixed to illustrate a specific point. B. Difference in future expected value on trial 4 as a function of time horizon. C. Difference in action value in a scenario in which one of the targets is chosen, and it is rewarded every time except in trial 16.

**Fig 3. Utilities in the novelty task.**
Panels A-D, γ = 0.95. A. Total expected value for the three options across a 25 trial sample. Stars indicate trials on which novel choice options were introduced. Colors indicate each choice option. B. Difference in future expected values (FEV: 1–2: difference between future expected value for choosing 1 vs. 2, etc.). C. Future expected values for the three options. Stars indicate trials on which novel choice options were introduced. Colors indicate each of the 3 choice options. D. Choices and rewards for the 25 trial sequence shown in panels A-C. Stars indicate where novel options were introduced. Blue symbols indicate choices that were rewarded (R+), black symbols indicate choices that were not rewarded (R-). Position on the y-axis indicates the choice (e.g. Ch 1 is choose option 1). E. Discount function for different discount rates. F. Exploration bonus (i.e. difference between option 1 and option 2 when option 1 is replaced at trial 15) as a function of discount parameter, and as a function of the probability of substituting a novel option. As discount parameter approaches 1, and the time horizon extends further into the future, the novelty bonus increases. There are 3 x-axes. The first two correspond to plotting the novelty bonus as a function of either the calculated time horizon, or the discount. The x-axis enumerated in trials is the number of trials, N = -1/log_e(γ), at which the utility is discounted by 1/e. The third is the x-axis for the substitution rate, plotted with γ = 0.95. Substitution rate is p = 0.05 for time horizon line and all other data.

**Fig 4. Utilities in nonstationary 2-armed bandit.**
A. Utility as a function of mean of option 1 and mean of option 2, with standard deviation of both options set to 4 and discount rate, γ = 0.90. B. Utility as a function of standard deviation for 2 discount values when mean is 50 for both options and standard deviation is 4 for option 2. C. Estimate of mean and standard deviation of options 1 and 2 as they are sampled under a condition where means are fixed at 45 and 55. Black line indicates choice of option 1 (y = 5) or option 2 (y = 15). Discount rate γ = 0.90. D. Same as panel C, except γ = 0.99. E. Plot of action value for two options for data plotted in C, γ = 0.90. F. Action value for two options for data plotted in panel D, γ = 0.99. G. Example sequence of samples and estimates of mean and variance, γ = 0.90 for means drawn from the generative model. H. Example sequence of samples and estimates of mean and variance, γ = 0.99.

**Fig 5. Beads task.**
A. Example distribution of beads in the beads in the jar task. B. State space for beads task. Example sequence of draws is taken from panel C. C-F. Action value for the three choice options as a function of draws for example sequences. Bead outcomes are shown as orange and blue beads. The star indicates the first trial on which the expected value of choosing an urn is greater than the expected value of drawing again. In this case an ideal observer would guess the urn with the highest value. Not that this is the value after seeing the bead shown in the corresponding trial. In panels C-E, cost to sample is C(s_t,a) = -0.005. In panel F, C(s_t,a) = -0.025.

**Fig 6. Patch leaving foraging task.**
A. State space model for the task. The full state space has been collapsed. The state space shown would be repeated, one for each juice, travel delay combination. We show this here by indexing the choice state by these variables. B. Difference in action value for stay in patch vs. travel as a function of current juice and current travel delay. Yellow line indicates point of indifference between stay and travel. C. Difference in action value (same data as plotted in B) with each line representing a different travel delay. D. Average time in patch as a function of current travel delay. Note the curve is discontinuous because of the discretization of the problem. E. Difference in utility with an infinite, undiscounted time horizon. Yellow line indicates point of indifference between stay and travel. F. Difference in action values with an undiscounted, infinite time horizon. Note that the travel delay does not affect value, as would be expected.

**Fig 7. Sampling foraging task.**
A. State space model for the foraging task. The numbers in the circles indicate one of the offer pairs. As there were 6 available individual gambles, there were 15 offer pairs possible in each foraging round. Bottom of panel shows gambles that would be available in a specific foraging bout. In each trial subjects are shown a randomly sampled pair from the 6. If they accept the pair, they move on to the decision stage. If they sample again, a new pair is shown, and they have to decide whether to accept the pair, or sample again, etc. B. Expected value for accepting the current gamble or sampling again for an example sequence of draws. Option below trial number is the option pair that was presented on that trial.

**Fig 8. Example MDP.**
Note that from state 1, picking action 1 leads to a reward of 1000, and a deterministic transition to state 2. Picking action 2 from state 1 leads to a reward of 1 and a deterministic transition to state 2. Only one action is available in state 2. It leads to a reward of 1 and a deterministic self-transition.

See this image and copyright information in PMC

References

1. Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience 8: 1704–1711. - PubMed
1. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ (2011) Model-based influences on humans' choices and striatal prediction errors. Neuron 69: 1204–1215. 10.1016/j.neuron.2011.02.027 - DOI - PMC - PubMed
1. Wilson RC, Takahashi YK, Schoenbaum G, Niv Y (2014) Orbitofrontal cortex as a cognitive map of task space. Neuron 81: 267–279. 10.1016/j.neuron.2013.11.005 - DOI - PMC - PubMed
1. Furl N, Averbeck BB (2011) Parietal cortex and insula relate to evidence seeking relevant to reward-related decisions. The Journal of neuroscience: the official journal of the Society for Neuroscience 31: 17572–17582. 10.1523/JNEUROSCI.4236-11.2011 - DOI - PMC - PubMed
1. Averbeck BB, Djamshidian A, O'Sullivan SS, Housden CR, Roiser JP, Lees AJ (2013) Uncertainty about mapping future actions into rewards may underlie performance on multiple measures of impulsivity in behavioral addiction: Evidence from Parkinson's disease. Behavioral neuroscience 127: 245–255. 10.1037/a0032079 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Theory of choice in bandit, information sampling and foraging tasks

Affiliation

Theory of choice in bandit, information sampling and foraging tasks

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources