Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 22;119(47):e2206625119.
doi: 10.1073/pnas.2206625119. Epub 2022 Nov 14.

Acquisition of chess knowledge in AlphaZero

Affiliations

Acquisition of chess knowledge in AlphaZero

Thomas McGrath et al. Proc Natl Acad Sci U S A. .

Abstract

We analyze the knowledge acquired by AlphaZero, a neural network engine that learns chess solely by playing against itself yet becomes capable of outperforming human chess players. Although the system trains without access to human games or guidance, it appears to learn concepts analogous to those used by human chess players. We provide two lines of evidence. Linear probes applied to AlphaZero's internal state enable us to quantify when and where such concepts are represented in the network. We also describe a behavioral analysis of opening play, including qualitative commentary by a former world chess champion.

Keywords: artificial intelligence; deep learning; interpretability; machine learning; reinforcement learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Probing for human-encoded chess concepts in the AlphaZero network (shown in blue). A probe—a generalized linear function g(zd)—is trained to approximate c(z0), a human-interpretable concept of chess position z0. The quality of approximation g(zd)c(z0), when averaged over a test set, gives an indication of how well a layer (linearly) encodes a concept. For a given concept, the process is repeated for the sequence of networks that are produced during training for all the layers in each network.
Fig. 2.
Fig. 2.
What–when–where plots for a selection of Stockfish 8 and custom concepts. Following Fig. 1, we count a ResNet “block” as a layer. (A) Stockfish 8’s evaluation of total score. (B) Is the playing side in check? (C) Stockfish 8’s evaluation of threats. (D) Can the playing side capture the opponent’s queen? (E) Could the opposing side checkmate the playing side in one move? (F) Stockfish 8’s evaluation of “material score.” (G) Stockfish 8’s material score. Past 105 training steps this becomes less predictable from AlphaZero’s later layers. (H) Does the playing side have a pawn that is pinned to the king?.
Fig. 3.
Fig. 3.
Evidence of patterns in regression residuals. (A, Upper) True and predicted values for Stockfish 8 score total_t_ph on a test set for the probe at depth 10 for the network after a million training steps. The dashed line indicates a perfect fit. The red markers indicate predictions in the 99.95th percentile of residuals. (A, Lower) True value and residual for Stockfish 8 score as in A, Upper. The red dashed line indicates the 99.95th percentile cutoff. (B) High-residual positions corresponding to the data points marked in A. Note that black’s queen can be taken in all positions. This is true for all 12 high-residual pieces where the regressed score is more favorable to white than the Stockfish score.
Fig. 4.
Fig. 4.
Value regression methodology. We train a generalized linear model on concepts to predict AlphaZero’s value head for each neural network checkpoint.
Fig. 5.
Fig. 5.
Value regression from human-defined concepts over time. (A) Piece value weights converge to values close to those used in conventional analysis. Error bars show 95% CIs of the mean across three seeds, with three samples per seed. (B) Material predicts value early in training, with more subtle concepts, such as mobility and king safety, emerging later. Error bars show 95% CIs of the mean across three seeds, with three samples per seed.
Fig. 6.
Fig. 6.
A comparison between AlphaZero’s and human first move preferences over training steps and time. (A) The evolution of the first move preference for white over the course of human history, spanning back to the earliest recorded games of modern chess in the ChessBase database. The early popularity of 1. e4 gives way to a more balanced exploration of different opening systems and an increasing adoption of more flexible systems in modern times. (B) The AlphaZero policy head’s preferences of opening move as a function of training steps. Training steps are shown on a logarithmic scale. Here, AlphaZero was trained three times from three different random seeds. AlphaZero’s opening evolution starts by weighing all moves equally, no matter how bad, and then, narrows down options. It stands in contrast with the progression of human knowledge, which gradually expanded from 1. e4. The AlphaZero prior swings to a marked preference for 1. d4 in the later stages of training. This preference should not be overinterpreted, as self-play training is based on quick games enriched with stochasticity for boosting exploration.
Fig. 7.
Fig. 7.
The top four six-ply continuations from AlphaZero’s prior after 1. e4 e5 2. ♘f3 ♘c6 3. ♗b5 in shades of red and the top six six-ply human grandmaster continuations in shades of blue. Assuming 25 moves per ply, there are around a quarter of a billion (256) such six-ply lines of play. Top and Middle show the relative contribution of these 10 lines in AlphaZero’s prior as training progresses and their frequency in 40 y of human grandmaster play. The AlphaZero training run rapidly adopts the Berlin defense (3 … ♘f6) after around 100k training steps (Table 1). The rows of the table in Bottom show the 10 lines, with identical plies being faded to show branching points. The light green bars next to the moves compare the prior preference of the fully trained AlphaZero network (green) with grandmaster move frequencies in 2018 (pink).
Fig. 8.
Fig. 8.
Rapid discovery of basic openings. The randomly initialized AlphaZero network gives a roughly uniform prior over all moves. The distribution stays roughly uniform for the first 25k training iterations, after which popular opening moves quickly gain prominence. In particular, 1. e4 is fully adopted as a sensible move in a window of 10k training steps or in a window of 1% of AlphaZero’s training time. (A) After 25k training iterations, e4 and d4 are discovered to be good opening moves and rapidly adopted within a short period of around 30k training steps. (B) Rapid discovery of options given 1. e4 e5. Within a short space of time, ♘f3 is settled on as a standard reply, whereas d4 is considered and discarded. (C) In the window between 25k and 60k training steps, AlphaZero learns to put 80% of its mass on four replies to e4 and 20% of its mass on all other 16 moves.

References

    1. Turing A., “Digital computers applied to games” in Faster than Thought: A Symposium on Digital Computing Machines, Bowden B. V., Ed. (Pitman Publishing, London, United Kingdom, 1953), pp. 286–310.
    1. Campbell M., Hoane A. J. Jr., Hsu F., Deep blue. Artif. Intell. 134, 57–83 (2002).
    1. Silver D., et al. ., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140–1144 (2018). - PubMed
    1. Bau D., et al. ., Understanding the role of individual units in a deep neural network. Proc. Natl. Acad. Sci. U.S.A. 117, 30071–30078 (2020). - PMC - PubMed
    1. Olah C., et al. ., The building blocks of interpretability. Distill 3, e10 (2018).

LinkOut - more resources