Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr:173:34-42.
doi: 10.1016/j.cognition.2017.12.014. Epub 2017 Dec 29.

Deconstructing the human algorithms for exploration

Affiliations

Deconstructing the human algorithms for exploration

Samuel J Gershman. Cognition. 2018 Apr.

Abstract

The dilemma between information gathering (exploration) and reward seeking (exploitation) is a fundamental problem for reinforcement learning agents. How humans resolve this dilemma is still an open question, because experiments have provided equivocal evidence about the underlying algorithms used by humans. We show that two families of algorithms can be distinguished in terms of how uncertainty affects exploration. Algorithms based on uncertainty bonuses predict a change in response bias as a function of uncertainty, whereas algorithms based on sampling predict a change in response slope. Two experiments provide evidence for both bias and slope changes, and computational modeling confirms that a hybrid model is the best quantitative account of the data.

Keywords: Bayesian inference; Explore-exploit dilemma; Reinforcement learning.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Effects of uncertainty on choice probability for upper confidence bound (Left) and Thompson sampling (Right) algorithms
Each curve shows the probability of choice as a function of the difference in estimated value between the two arms, plotted separately for low/high posterior standard deviation (SD).
Figure 2
Figure 2. Experiment 1: regression results
Choice probability was modeled using probit regression with 3 regressors: value difference (V), relative uncertainty (RU), and value difference normalized by total uncertainty (V/TU). The first 3 panels show the regression coefficients plotted separately for simulated data (UCB and Thompson sampling, top panels) and the empirical data from Experiment 1 (bottom left). The bottom right panel shows the empirical choice probability functions separately for low and high TU, based on a median split. Error bars represent standard error of the mean.
Figure 3
Figure 3. Experiment 2: regression results
Choice probability was modeled using probit regression with 3 regressors: value difference (V), relative uncertainty (RU), and value difference normalized by total uncertainty (V/TU). The first 3 panels show the regression coefficients plotted separately for simulated data (UCB and Thompson sampling, top panels) and the empirical data from Experiment 2 (bottom left). The bottom right panel shows the empirical choice probability functions separately for low and high TU, based on a median split. Error bars represent standard error of the mean.
Figure 4
Figure 4. Bayesian model comparison
Protected exceedance probability (PXP) for each model, estimated separately for Experiments 1 and 2. “Value” refers to the value-directed exploration model, in which choice probability is a function of the difference in value between the two arms.
Figure 5
Figure 5. Choice probabilities on the first trial of each block
(Left) Participants preferentially chose arm 1 in Experiment 1 (where it has higher uncertainty than arm 2) but not in Experiment 2 (where it has equal uncertainty). (Middle) The relative uncertainty (RU) coefficient, which captures the degree to which an individual relies on directed exploration, correlated with individual differences in arm 1 preference on the first trial. (Right) Average probability of choosing the risky option on each trial of the experiment.
Figure 6
Figure 6. Response time regression results
Log response times were modeled using linear regression with value difference (V), relative uncertainty (RU), and total uncertainty (TU) regressors. Top left: coefficients estimated from Experiment 1 data. Error bars represent standard error of the mean. Top right: coefficients estimated from Experiment 2 data. Bottom panels show the relationship between choice and response time (RT) coefficients for the relative uncertainty regressor, with superimposed least-squares line.
Figure 7
Figure 7. Relative performance of models
Probability of choosing the optimal arm is plotted as a function of γ (undirected exploration coefficient), averaged across 500 simulated participants. The hybrid algorithm (with β = 4; see Appendix) consistently outperforms both the UCB and Thompson sampling algorithms on the task used in Experiment 2.

References

    1. Acuña D, Schrater P. Structure learning in human sequential decision-making. PLoS Computational Biology. 2010;6:e1001003. - PMC - PubMed
    1. Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem. Machine Learning. 2002;47:235–256.
    1. Barron G, Erev I. Small feedback-based decisions and their limited correspondence to description-based decisions. Journal of Behavioral Decision Making. 2003;16:215–233.
    1. Bishop CM. Pattern recognition and machine learning. Springer; 2006.
    1. Chapelle O, Li L. Advances in neural information processing systems. 2011. An empirical evaluation of Thompson sampling; pp. 2249–2257.

Publication types

LinkOut - more resources