Student of Games: A unified learning algorithm for both perfect and imperfect information games

Martin Schmid^{1

2}, Matej Moravčík^{1

2}, Neil Burch^{2

3

4}, Rudolf Kadlec^{1

2}, Josh Davidson^{2

3}, Kevin Waugh^{2

3}, Nolan Bard^{2

3}, Finbarr Timbers^{2

5}, Marc Lanctot^{2

6}, G Zacharias Holland^{2

3}, Elnaz Davoodi^{2

6}, Alden Christianson^{2

7}, Michael Bowling^{2

4

7}

Affiliations

¹ EquiLibre Technologies, Prague, Czechia.
² Google Deepmind.
³ Sony AI, New York, NY, USA.
⁴ Amii, Edmonton, Canada.
⁵ Midjourney, South San Francisco, CA, USA.
⁶ Google Deepmind, Montreal, Canada.
⁷ University of Alberta, Edmonton, Canada.

PMID: 37967182
PMCID: PMC10651118
DOI: 10.1126/sciadv.adg3256

Student of Games: A unified learning algorithm for both perfect and imperfect information games

Martin Schmid et al. Sci Adv. 2023.

. 2023 Nov 17;9(46):eadg3256.

doi: 10.1126/sciadv.adg3256. Epub 2023 Nov 15.

Authors

Affiliations

¹ EquiLibre Technologies, Prague, Czechia.
² Google Deepmind.
³ Sony AI, New York, NY, USA.
⁴ Amii, Edmonton, Canada.
⁵ Midjourney, South San Francisco, CA, USA.
⁶ Google Deepmind, Montreal, Canada.
⁷ University of Alberta, Edmonton, Canada.

PMID: 37967182
PMCID: PMC10651118
DOI: 10.1126/sciadv.adg3256

Abstract

Games have a long history as benchmarks for progress in artificial intelligence. Approaches using search and learning produced strong performance across many perfect information games, and approaches using game-theoretic reasoning and learning demonstrated strong performance for specific imperfect information poker variants. We introduce Student of Games, a general-purpose algorithm that unifies previous approaches, combining guided search, self-play learning, and game-theoretic reasoning. Student of Games achieves strong empirical performance in large perfect and imperfect information games-an important step toward truly general algorithms for arbitrary environments. We prove that Student of Games is sound, converging to perfect play as available computation and approximation capacity increases. Student of Games reaches strong performance in chess and Go, beats the strongest openly available agent in heads-up no-limit Texas hold'em poker, and defeats the state-of-the-art agent in Scotland Yard, an imperfect information game that illustrates the value of guided search, learning, and game-theoretic reasoning.

PubMed Disclaimer

Figures

**Fig. 1.. An example structure of public belief state β = (s_pub, r).**
s_pub translates to two sets of information states, one for player 1, $S_{1} (s_{pub}) = {{\bar{s}}_{0}, {\bar{s}}_{1}}$ , and one for player 2, 𝒮₂(s_pub) = {s₀, s₁, s₂}. Each information state includes different partitions of possible histories. Finally, r contains reach probabilities for information states for both players.

**Fig. 2.. An example of depth-limited CFR solving using decomposition in a game with two specific subgames shown.**
Standard CFR would require traversing all the subgames. Depth-limited CFR decomposes the solve into running down to depth d = 2 and using v = v_θ(β) to represent the second subgame’s values. On the downward pass, ranges r are formed from policy reach probabilities. Values are passed back up to tabulate accumulating regrets. Re-solving a subgame would require construction of an auxiliary game (36) (not shown).

**Fig. 3.. Exploitability of SoG as a function of the number of training steps under different number of simulations of GT-CFR.**
For both (A) Leduc poker and (B) Scotland Yard (glasses map), each line corresponds to a different evaluation condition, e.g., SoG(s, c) used at evaluation time. The ribbon shows minimum and maximum exploitability out of 50 seeded runs for each setup. The units of the y axis in Leduc poker are milli–big blinds per hand (mbb/h), which corresponds to one thousandth of a chip in Leduc. In Scotland Yard, the reward is either −1 (loss) or +1 (win). All networks were trained using a single training run of SoG(100,1), and the x values correspond to a network trained for the corresponding number of steps.

**Fig. 4.. Scalability of SoG with increasing number of neural network evaluations compared to AlphaZero measured on relative Elo scale.**
The x axis corresponds to the number of simulations in AlphaZero and s in SoG(s, c). Elo of SoG(s = 800, c) was set to be 0. In chess (A), c = 10 for all runs, with varying s ∈ {800,2400,7200,21600,64800}. In Go (B), we graph SoG using (s, c) ∈ {(800,1), (2000,10), (4000,10), (8000,10), (16000,16).}

**Fig. 5.. Win rate of SoG(400,1) against PimBot with varying simulations.**
Two thousand matches were played for each data point, with roles swapped for half of the matches. Note that the x axis has logarithmic scale. The ribbon shows 95% confidence interval.

**Fig. 6.. A counterfactual value-and-policy network (CVPN).**
Each query, β, to the network includes beliefs r and an encoding of s_pub to get the counterfactual values v for both players and policies p for the acting player in each information state *s_i* ∈ s_pub(h), producing outputs f_θ. Since players may have different actions spaces (as in, e.g., Scotland Yard), there are two sets of policy outputs: one for each player, and p refers to the one for the acting player at s_pub only (depicted as player 1 in this diagram by graying out player 2’s policy output).

**Fig. 7.. Overview of the phases in one iteration of GT-CFR.**
The regret update phase propagates beliefs down the tree, obtains counterfactual values from the CVPN at leaf nodes (or from the environment at terminals), and passes back counterfactual values to apply the CFR update. The expansion phase simulates a trajectory from the root to a leaf, adding public states to the tree. In this case, the trajectory starts in the public belief state s_pub by sampling the information state s₀. After that, the sampled action a₀ leads to the information state $s_{0}^{0}$ in public state $s_{pub}^{0}$ , and finally, the action a₁ leads to a new public state that is added to the tree.

**Fig. 8.. SoG training process.**
Actors collect data via sound self-play and trainers run separately over a distributed network. (A) Each search produces a number of CVPN queries with input β. (B) Queries are added to a query buffer and subsequently solved by a solver that studies the situation more closely via another invocation of GT-CFR. During solving, new recursive queries might be added back to the query buffer; separately, the network is (C) trained on minibatches sampled from the replay buffer to predict values and policy targets computed by the solver.

See this image and copyright information in PMC

References

1. A. L. Samuel, Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 44, 206–226 (2000).
1. S. J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach (Pearson Education, ed. 3, 2010).
1. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). - PubMed
1. M. Campbell, A. J. Hoane, F.-H. Hsu, Deep Blue. Artif. Intell. 134, 57–83 (2002).
1. G. Tesauro, TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994).

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Student of Games: A unified learning algorithm for both perfect and imperfect information games

Affiliations

Student of Games: A unified learning algorithm for both perfect and imperfect information games

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources