Finding structure in multi-armed bandits

Eric Schulz¹, Nicholas T Franklin², Samuel J Gershman³

Affiliations

¹ Harvard University, United States. Electronic address: eric.schulz@mpg.tuebingen.de.
² Harvard University, United States. Electronic address: nfranklin@fas.harvard.edu.
³ Harvard University, United States.

PMID: 32059133
DOI: 10.1016/j.cogpsych.2019.101261

Finding structure in multi-armed bandits

Eric Schulz et al. Cogn Psychol. 2020 Jun.

. 2020 Jun:119:101261.

doi: 10.1016/j.cogpsych.2019.101261. Epub 2020 Feb 12.

Authors

Eric Schulz¹, Nicholas T Franklin², Samuel J Gershman³

Affiliations

¹ Harvard University, United States. Electronic address: eric.schulz@mpg.tuebingen.de.
² Harvard University, United States. Electronic address: nfranklin@fas.harvard.edu.
³ Harvard University, United States.

PMID: 32059133
DOI: 10.1016/j.cogpsych.2019.101261

Abstract

How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option's spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.

Keywords: Decision making; Exploration-exploitation; Function learning; Gaussian process; Generalization; Latent structure; Learning; Learning-to-learn; Reinforcement learning; Structure learning.

PubMed Disclaimer

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Finding structure in multi-armed bandits

Affiliations

Finding structure in multi-armed bandits

Authors

Affiliations

Abstract

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources