Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 17;9(46):eadg3256.
doi: 10.1126/sciadv.adg3256. Epub 2023 Nov 15.

Student of Games: A unified learning algorithm for both perfect and imperfect information games

Affiliations

Student of Games: A unified learning algorithm for both perfect and imperfect information games

Martin Schmid et al. Sci Adv. .

Abstract

Games have a long history as benchmarks for progress in artificial intelligence. Approaches using search and learning produced strong performance across many perfect information games, and approaches using game-theoretic reasoning and learning demonstrated strong performance for specific imperfect information poker variants. We introduce Student of Games, a general-purpose algorithm that unifies previous approaches, combining guided search, self-play learning, and game-theoretic reasoning. Student of Games achieves strong empirical performance in large perfect and imperfect information games-an important step toward truly general algorithms for arbitrary environments. We prove that Student of Games is sound, converging to perfect play as available computation and approximation capacity increases. Student of Games reaches strong performance in chess and Go, beats the strongest openly available agent in heads-up no-limit Texas hold'em poker, and defeats the state-of-the-art agent in Scotland Yard, an imperfect information game that illustrates the value of guided search, learning, and game-theoretic reasoning.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. An example structure of public belief state β = (spub, r).
spub translates to two sets of information states, one for player 1, S1(spub)={s¯0,s¯1}, and one for player 2, 𝒮2(spub) = {s0, s1, s2}. Each information state includes different partitions of possible histories. Finally, r contains reach probabilities for information states for both players.
Fig. 2.
Fig. 2.. An example of depth-limited CFR solving using decomposition in a game with two specific subgames shown.
Standard CFR would require traversing all the subgames. Depth-limited CFR decomposes the solve into running down to depth d = 2 and using v = vθ(β) to represent the second subgame’s values. On the downward pass, ranges r are formed from policy reach probabilities. Values are passed back up to tabulate accumulating regrets. Re-solving a subgame would require construction of an auxiliary game (36) (not shown).
Fig. 3.
Fig. 3.. Exploitability of SoG as a function of the number of training steps under different number of simulations of GT-CFR.
For both (A) Leduc poker and (B) Scotland Yard (glasses map), each line corresponds to a different evaluation condition, e.g., SoG(s, c) used at evaluation time. The ribbon shows minimum and maximum exploitability out of 50 seeded runs for each setup. The units of the y axis in Leduc poker are milli–big blinds per hand (mbb/h), which corresponds to one thousandth of a chip in Leduc. In Scotland Yard, the reward is either −1 (loss) or +1 (win). All networks were trained using a single training run of SoG(100,1), and the x values correspond to a network trained for the corresponding number of steps.
Fig. 4.
Fig. 4.. Scalability of SoG with increasing number of neural network evaluations compared to AlphaZero measured on relative Elo scale.
The x axis corresponds to the number of simulations in AlphaZero and s in SoG(s, c). Elo of SoG(s = 800, c) was set to be 0. In chess (A), c = 10 for all runs, with varying s ∈ {800,2400,7200,21600,64800}. In Go (B), we graph SoG using (s, c) ∈ {(800,1), (2000,10), (4000,10), (8000,10), (16000,16).}
Fig. 5.
Fig. 5.. Win rate of SoG(400,1) against PimBot with varying simulations.
Two thousand matches were played for each data point, with roles swapped for half of the matches. Note that the x axis has logarithmic scale. The ribbon shows 95% confidence interval.
Fig. 6.
Fig. 6.. A counterfactual value-and-policy network (CVPN).
Each query, β, to the network includes beliefs r and an encoding of spub to get the counterfactual values v for both players and policies p for the acting player in each information state sispub(h), producing outputs fθ. Since players may have different actions spaces (as in, e.g., Scotland Yard), there are two sets of policy outputs: one for each player, and p refers to the one for the acting player at spub only (depicted as player 1 in this diagram by graying out player 2’s policy output).
Fig. 7.
Fig. 7.. Overview of the phases in one iteration of GT-CFR.
The regret update phase propagates beliefs down the tree, obtains counterfactual values from the CVPN at leaf nodes (or from the environment at terminals), and passes back counterfactual values to apply the CFR update. The expansion phase simulates a trajectory from the root to a leaf, adding public states to the tree. In this case, the trajectory starts in the public belief state spub by sampling the information state s0. After that, the sampled action a0 leads to the information state s00 in public state spub0, and finally, the action a1 leads to a new public state that is added to the tree.
Fig. 8.
Fig. 8.. SoG training process.
Actors collect data via sound self-play and trainers run separately over a distributed network. (A) Each search produces a number of CVPN queries with input β. (B) Queries are added to a query buffer and subsequently solved by a solver that studies the situation more closely via another invocation of GT-CFR. During solving, new recursive queries might be added back to the query buffer; separately, the network is (C) trained on minibatches sampled from the replay buffer to predict values and policy targets computed by the solver.

References

    1. A. L. Samuel, Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 44, 206–226 (2000).
    1. S. J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach (Pearson Education, ed. 3, 2010).
    1. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). - PubMed
    1. M. Campbell, A. J. Hoane, F.-H. Hsu, Deep Blue. Artif. Intell. 134, 57–83 (2002).
    1. G. Tesauro, TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994).