Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec;143(6):2074-81.
doi: 10.1037/a0038199. Epub 2014 Oct 27.

Humans use directed and random exploration to solve the explore-exploit dilemma

Affiliations

Humans use directed and random exploration to solve the explore-exploit dilemma

Robert C Wilson et al. J Exp Psychol Gen. 2014 Dec.

Abstract

All adaptive organisms face the fundamental tradeoff between pursuing a known reward (exploitation) and sampling lesser-known options in search of something better (exploration). Theory suggests at least two strategies for solving this dilemma: a directed strategy in which choices are explicitly biased toward information seeking, and a random strategy in which decision noise leads to exploration by chance. In this work we investigated the extent to which humans use these two strategies. In our "Horizon task," participants made explore-exploit decisions in two contexts that differed in the number of choices that they would make in the future (the time horizon). Participants were allowed to make either a single choice in each game (horizon 1), or 6 sequential choices (horizon 6), giving them more opportunity to explore. By modeling the behavior in these two conditions, we were able to measure exploration-related changes in decision making and quantify the contributions of the two strategies to behavior. We found that participants were more information seeking and had higher decision noise with the longer horizon, suggesting that humans use both strategies to solve the exploration-exploitation dilemma. We thus conclude that both information seeking and choice variability can be controlled and put to use in the service of exploration.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Task design (A). Example screen shots from a horizon 6 game showing the first free-choice trial, the result of the choice (after it is made), and the start of the second free-choice trial (B). Schematic showing the different trial types in the two horizon conditions. Each game began with four forced-choice trials before one or a sequence of six free-choice trials. In all conditions, the first free-choice trial (orange) was the main focus of subsequent analyses (C). Learning curves showing the fraction of times the correct option (i.e., the option with the higher generative mean) was chosen as a function of free-choice trial number for the different horizon conditions. This demonstrates that participants performed at above-chance levels and improved as the game progressed (D). Correlation between the difference in observed means of each option and the difference in the number of times each option has been played as a function of free-choice trial number. Only the very first trial showed an absence of correlation; thereafter, there was a strong correlation as participants received more information about the more rewarding options that they selected. See the online article for the color version of this figure.
Figure 2
Figure 2
Behavior on the first free-choice trial. Choice curves for the unequal (A) and equal (B) information conditions. Filled circles = experimental data; solid lines = average over participants of model-derived choice curves (A). These curves show the fraction of times the more informative option was chosen on the first free-choice trial as a function of the difference in mean between the more informative option and the less informative option. As the horizon increased, the more informative option was chosen more often, indicating an information bonus. In addition, the slope of the curves decreased, indicating a change in decision noise (B). In the equal condition, as the horizon increased there was no change in indifference point because both options were equally informative. The slope of the curves, however, decreased again, consistent with an increase in decision noise with horizon. Mean parameter fits for the information bonus (C) and decision noise in the [1 3] condition (D) and decision noise in the [2 2] condition (E) showing an increase in information bonus and decision noise between horizons 1 and 6. Error bars are s.e.m. across participants. Scatter plots comparing parameter fits for individual subjects in horizon 1 and horizon 6. The dashed lines denote equality. This shows that the increase in information bonus (F) and decision noise (G and H) with horizon holds for almost all of the subjects. See the online article for the color version of this figure.
Figure 3
Figure 3
Change in decision noise across the horizon 6 game (A). Choice curves for the equal information conditions on the first [2 2], third [3 3] and fifth [4 4] free-choice trials in horizon 6 games. This shows an increase in slope on the later trials relative to the first trial consistent with a decrease in random exploration (B). Participants’ mean decision noise extracted from the model fits showing a decrease in decision noise in the equal conditions over the horizon 6 games (C, D, E). Comparison between the individual participants’ decision noise in the three information conditions shows that the decrease in decision noise between the first and later trials holds for most subjects (C) and (D), whereas there is no difference between the decision noise in the [3 3] and [4 4] conditions (E). See the online article for the color version of this figure.

References

    1. Aston-Jones G, Cohen JD. An integrative theory of locus coeruleus-norepinephrine function: Adaptive gain and optimal performance. Annual Review of Neuroscience. 2005;28:403–450. http://dx.doi.org/10.1146/annurev.neuro.28.061604.135709. - DOI - PubMed
    1. Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem. Machine Learning. 2002;47:235–256. http://dx.doi.org/10.1023/A:1013689704352. - DOI
    1. Banks J, Olson M, Porter D. An experimental analysis of the bandit problem. Economic Theory. 1997;10:55–77. http://dx.doi.org/10.1007/s001990050146. - DOI
    1. Beck JM, Ma WJ, Pitkow X, Latham PE, Pouget A. Not noisy, just wrong: The role of suboptimal inference in behavioral variability. Neuron. 2012;74:30–39. http://dx.doi.org/10.1016/j.neuron.2012.03.016. - DOI - PMC - PubMed
    1. Bier VM, Connell BL. Ambiguity seeking in multi-attribute decisions: Effects of optimism and message framing. Journal of Behavioral Decision Making. 1994;7:169–182. http://dx.doi.org/10.1002/bdm.3960070303. - DOI