Magnetic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave^#¹, Federico Felici^#², Jonas Buchli^#³, Michael Neunert^#¹, Brendan Tracey^#⁴, Francesco Carpanese^#^{1

5}, Timo Ewalds^#¹, Roland Hafner^#¹, Abbas Abdolmaleki¹, Diego de Las Casas¹, Craig Donner¹, Leslie Fritz¹, Cristian Galperti⁵, Andrea Huber¹, James Keeling¹, Maria Tsimpoukelli¹, Jackie Kay¹, Antoine Merle⁵, Jean-Marc Moret⁵, Seb Noury¹, Federico Pesamosca⁵, David Pfau¹, Olivier Sauter⁵, Cristian Sommariva⁵, Stefano Coda⁵, Basil Duval⁵, Ambrogio Fasoli⁵, Pushmeet Kohli¹, Koray Kavukcuoglu¹, Demis Hassabis¹, Martin Riedmiller^#¹

Affiliations

¹ DeepMind, London, UK.
² Swiss Plasma Center - EPFL, Lausanne, Switzerland. federico.felici@epfl.ch.
³ DeepMind, London, UK. buchli@deepmind.com.
⁴ DeepMind, London, UK. btracey@deepmind.com.
⁵ Swiss Plasma Center - EPFL, Lausanne, Switzerland.

^# Contributed equally.

PMID: 35173339
PMCID: PMC8850200
DOI: 10.1038/s41586-021-04301-9

Magnetic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave et al. Nature. 2022 Feb.

. 2022 Feb;602(7897):414-419.

doi: 10.1038/s41586-021-04301-9. Epub 2022 Feb 16.

Authors

Affiliations

¹ DeepMind, London, UK.
² Swiss Plasma Center - EPFL, Lausanne, Switzerland. federico.felici@epfl.ch.
³ DeepMind, London, UK. buchli@deepmind.com.
⁴ DeepMind, London, UK. btracey@deepmind.com.
⁵ Swiss Plasma Center - EPFL, Lausanne, Switzerland.

^# Contributed equally.

PMID: 35173339
PMCID: PMC8850200
DOI: 10.1038/s41586-021-04301-9

Abstract

Nuclear fusion using magnetic confinement, in particular in the tokamak configuration, is a promising path towards sustainable energy. A core challenge is to shape and maintain a high-temperature plasma within the tokamak vessel. This requires high-dimensional, high-frequency, closed-loop control using magnetic actuator coils, further complicated by the diverse requirements across a wide range of plasma configurations. In this work, we introduce a previously undescribed architecture for tokamak magnetic controller design that autonomously learns to command the full set of control coils. This architecture meets control objectives specified at a high level, at the same time satisfying physical and operational constraints. This approach has unprecedented flexibility and generality in problem specification and yields a notable reduction in design effort to produce new plasma configurations. We successfully produce and control a diverse set of plasma configurations on the Tokamak à Configuration Variable^1,2, including elongated, conventional shapes, as well as advanced configurations, such as negative triangularity and 'snowflake' configurations. Our approach achieves accurate tracking of the location, current and shape for these configurations. We also demonstrate sustained 'droplets' on TCV, in which two separate plasmas are maintained simultaneously within the vessel. This represents a notable advance for tokamak feedback control, showing the potential of reinforcement learning to accelerate research in the fusion domain, and is one of the most challenging real-world systems to which reinforcement learning has been applied.

PubMed Disclaimer

Conflict of interest statement

B.T., F.C., F.F., J.B., J.D., M.N., R.H. and T.E. have filed a provisional patent application about the contents of this manuscript. The remaining authors declare no competing interests.

Figures

**Fig. 1. Representation of the components of our controller design architecture.**
a, Depiction of the learning loop. The controller sends voltage commands on the basis of the current plasma state and control targets. These data are sent to the replay buffer, which feeds data to the learner to update the policy. b, Our environment interaction loop, consisting of a power supply model, sensing model, environment physical parameter variation and reward computation. c, Our control policy is an MLP with three hidden layers that takes measurements and control targets and outputs voltage commands. d–f, The interaction of TCV and the real-time-deployed control system implemented using either a conventional controller composed of many subcomponents (f) or our architecture using a single deep neural network to control all 19 coils directly (e). g, A depiction of TCV and the 19 actuated coils. The vessel is 1.5 m high, with minor radius 0.88 m and vessel half-width 0.26 m. h, A cross section of the vessel and plasma, with the important aspects labelled.

**Fig. 2. Fundamental capability demonstration.**
Demonstration of plasma current, vertical stability, position and shape control. Top, target shape points with 2 cm radius (blue circles), compared with the post-experiment equilibrium reconstruction (black continuous line in contour plot). Bottom left, target time traces (blue traces) compared with reconstructed observation (orange traces), with the window of diverted plasma marked (green rectangle). Bottom right, picture inside the vessel at 0.6 s showing the diverted plasma with its legs. Source data

**Fig. 3. Control demonstrations.**
Control demonstrations obtained during TCV experiments. Target shape points with 2 cm radius (blue circles), compared with the equilibrium reconstruction plasma boundary (black continuous line). In all figures, the first time slice shows the handover condition. a, Elongation of 1.9 with vertical instability growth rate of 1.4 kHz. b, Approximate ITER-proposed shape with neutral beam heating (NBH) entering H-mode. c, Diverted negative triangularity of −0.8. d, Snowflake configuration with a time-varying control of the bottom X-point, where the target X-points are marked in blue. Extended traces for these shots can be found in Extended Data Fig. 2. Source data

**Fig. 4. Droplets.**
Demonstration of sustained control of two independent droplets on TCV for the entire 200-ms control window. Left, control of I_p for each independent lobe up to the same target value. Right, a picture in which the two droplets are visible, taken from a camera looking into the vessel at t = 0.55. Source data

**Extended Data Fig. 1. Pictures and illustration of the TCV.**
**a, b** Photographs showing the part of the TCV inside the bioshield. c CAD drawing of the vessel and coils of the TCV. d View inside the TCV (Alain Herzog/EPFL), showing the limiter tiling, baffles and central column.

**Extended Data Fig. 2. A larger overview of the shots in Fig. 3.**
We plotted the reconstructed values for the normalized pressure β_p and safety factor q_A, along with the range of domain randomization these variables saw during training (in green), which can be found in Extended Data Table 2. We also plot the growth rate, γ, and the plasma current, I_p, along with the associated target value. Where relevant, we plot the elongation κ, the neutral beam heating, the triangularity δ and the vertical position of the bottom X-point Z_X and its target. Source data

**Extended Data Fig. 3. Control variability.**
To illustrate the variability of the performance that our deterministic controller achieves on the environment, we have plotted the trajectories of one policy that was used twice on the plant: in shot 70599 (in blue) and shot 70600 (in orange). The dotted line shows where the cross sections of the vessel are illustrated. The trajectories are shown from the handover at 0.0872 s until 0.65 s after the breakdown, after which, on shot 70600, the neutral beam heating was turned on and the two shots diverge. The green line shows the RMSE distance between the LCFS in the two experiments, providing a direct measure of the shape similarity between the two shots. This illustrates the repeatability of experiments both in shape parameters such as elongation κ and triangularity δ and in the error achieved with respect to the targets in plasma current I_p and the shape of the last closed-flux surface. Source data

**Extended Data Fig. 4. Further observations.**
a, When asked to stabilize the plasma without further specifications, the agent creates a round shape. The agent is in control from t = 0.45 and changes the shape while trying to attain R_a and Z_a targets. This discovered behaviour is indeed a good solution, as this round plasma is intrinsically stable with a growth rate γ < 0. b, When not given a reward to have similar current on both ohmic coils, the algorithm tended to use the E coils to obtain the same effect as the OH001 coil. This is indeed possible, as can be seen by the coil positions in Fig. 1g, but causes electromagnetic forces on the machine structures. Therefore, in later shots, a reward was added to keep the current in both ohmic coils close together. c, Voltage requests by the policy to avoid the E3 coil from sticking when crossing 0 A. As can be seen in, for example, Extended Data Fig. 4b, the currents can get stuck on 0 A for low voltage requests, a consequence of how these requests are handled by the power system. As this behaviour was hard to model, we introduced a reward to keep the coil currents away from 0 A. The control policy produces a high voltage request to move through this region quickly. d, An illustration of the difference in cross sections between two different shots, in which the only difference is that the policy on the right was trained with a further reward for avoiding X-points in vacuum. Source data

**Extended Data Fig. 5. Training progress.**
Episodic reward for the deterministic policy smoothed across 20 episodes with parameter variations enabled, in which 100 means that all objectives are perfectly met. a comparison of the learning curve for the capability benchmark (as shown in Fig. 2) using our asymmetric actor-critic versus a symmetric actor-critic, in which the critic is using the same real-time-capable feedforward network as the actor. In blue is the performance with the default critic of 718,337 parameters. In orange, we show the symmetric version, in which the critic has the same feedforward structure and size (266,497 parameters) as the policy (266,280 parameters). When we keep the feedforward structure of the symmetric critic and scale up the critic, we find that widening its width to 512 units (in green, 926,209 parameters) or even 1,024 units (in red, 3,425,281 parameters) does not bridge the performance gap with the smaller recurrent critic. b comparison between using various amounts of actors for stabilizing a mildly elongated plasma. Although the policies in this paper were trained with 5,000 actors, this comparison shows that, at least for simpler cases, the same level of performance can be achieved with much lower computational resources. Source data

See this image and copyright information in PMC

References

1. Hofmann F, et al. Creation and control of variably shaped plasmas in TCV. Plasma Phys. Control. Fusion. 1994;36:B277. doi: 10.1088/0741-3335/36/12B/023. - DOI
1. Coda S, et al. Physics research on the TCV tokamak facility: from conventional to alternative scenarios and beyond. Nucl. Fusion. 2019;59:112023. doi: 10.1088/1741-4326/ab25cb. - DOI
1. Anand H, Coda S, Felici F, Galperti C, Moret J-M. A novel plasma position and shape controller for advanced configuration development on the TCV tokamak. Nucl. Fusion. 2017;57:126026. doi: 10.1088/1741-4326/aa7f4d. - DOI
1. Mele A, et al. MIMO shape control at the EAST tokamak: simulations and experiments. Fusion Eng. Des. 2019;146:1282–1285. doi: 10.1016/j.fusengdes.2019.02.058. - DOI
1. Anand H, et al. Plasma flux expansion control on the DIII-D tokamak. Plasma Phys. Control. Fusion. 2020;63:015006. doi: 10.1088/1361-6587/abc457. - DOI

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Magnetic control of tokamak plasmas through deep reinforcement learning

Affiliations

Magnetic control of tokamak plasmas through deep reinforcement learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources