Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec;44(12):6690-6705.
doi: 10.1002/mp.12625. Epub 2017 Nov 14.

Deep reinforcement learning for automated radiation adaptation in lung cancer

Affiliations

Deep reinforcement learning for automated radiation adaptation in lung cancer

Huan-Hsin Tseng et al. Med Phys. 2017 Dec.

Abstract

Purpose: To investigate deep reinforcement learning (DRL) based on historical treatment plans for developing automated radiation adaptation protocols for nonsmall cell lung cancer (NSCLC) patients that aim to maximize tumor local control at reduced rates of radiation pneumonitis grade 2 (RP2).

Methods: In a retrospective population of 114 NSCLC patients who received radiotherapy, a three-component neural networks framework was developed for deep reinforcement learning (DRL) of dose fractionation adaptation. Large-scale patient characteristics included clinical, genetic, and imaging radiomics features in addition to tumor and lung dosimetric variables. First, a generative adversarial network (GAN) was employed to learn patient population characteristics necessary for DRL training from a relatively limited sample size. Second, a radiotherapy artificial environment (RAE) was reconstructed by a deep neural network (DNN) utilizing both original and synthetic data (by GAN) to estimate the transition probabilities for adaptation of personalized radiotherapy patients' treatment courses. Third, a deep Q-network (DQN) was applied to the RAE for choosing the optimal dose in a response-adapted treatment setting. This multicomponent reinforcement learning approach was benchmarked against real clinical decisions that were applied in an adaptive dose escalation clinical protocol. In which, 34 patients were treated based on avid PET signal in the tumor and constrained by a 17.2% normal tissue complication probability (NTCP) limit for RP2. The uncomplicated cure probability (P+) was used as a baseline reward function in the DRL.

Results: Taking our adaptive dose escalation protocol as a blueprint for the proposed DRL (GAN + RAE + DQN) architecture, we obtained an automated dose adaptation estimate for use at ∼2/3 of the way into the radiotherapy treatment course. By letting the DQN component freely control the estimated adaptive dose per fraction (ranging from 1-5 Gy), the DRL automatically favored dose escalation/de-escalation between 1.5 and 3.8 Gy, a range similar to that used in the clinical protocol. The same DQN yielded two patterns of dose escalation for the 34 test patients, but with different reward variants. First, using the baseline P+ reward function, individual adaptive fraction doses of the DQN had similar tendencies to the clinical data with an RMSE = 0.76 Gy; but adaptations suggested by the DQN were generally lower in magnitude (less aggressive). Second, by adjusting the P+ reward function with higher emphasis on mitigating local failure, better matching of doses between the DQN and the clinical protocol was achieved with an RMSE = 0.5 Gy. Moreover, the decisions selected by the DQN seemed to have better concordance with patients eventual outcomes. In comparison, the traditional temporal difference (TD) algorithm for reinforcement learning yielded an RMSE = 3.3 Gy due to numerical instabilities and lack of sufficient learning.

Conclusion: We demonstrated that automated dose adaptation by DRL is a feasible and a promising approach for achieving similar results to those chosen by clinicians. The process may require customization of the reward function if individual cases were to be considered. However, development of this framework into a fully credible autonomous system for clinical decision support would require further validation on larger multi-institutional datasets.

Keywords: adaptive radiotherapy; deep learning; lung cancer; reinforcement learning.

PubMed Disclaimer

Conflict of interest statement

The authors have no relevant conflicts of interest to disclose.

Figures

Figure 1
Figure 1
A three‐component DNN solution to overcome limited sample size and model the radiotherapy environment for DRL decision‐making.
Figure 2
Figure 2
GAN is used to generate new data, where G asks D to verify the authenticity of the data source. From latent points z, generated patients are synthesized as x~ in G. With y=(x,x~) mixing with real and the generated patient data, D is trying to verify its source. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 3
Figure 3
Machinery of DQN convergence. The left figure depicts the relation between {Y i } and {QDNNΘi}, where arrow “⇒” denotes “create” and “” denotes “approximate”. The right figure illustrates the convergence relationship between three sequences.
Figure 4
Figure 4
An incomplete MDP where the environment is unknown or missing. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 5
Figure 5
An approximated environment reconstructed from data to simulate radiotherapy adaptation response of patients. A DQN agent manages to extract information from the data of the right‐hand side based on the two transitions (blue‐dashed and red‐solid arrows, explained in Eq. (17) later) and submit actions at the 2/3 period of a treatment (right solid‐green arrow), where the environment is hidden in this illustration. At the end of a whole treatment course, the complete information is then collected for reconstructing the radiotherapy environment (dashed‐green arrow). [Color figure can be viewed at wileyonlinelibrary.com]
Figure 6
Figure 6
Random dropout DNNs (left) are used with data (z (0),z (l)) = (x,y) following notations in Sec. IID to reconstruct the transition probability of the environment (right), where P~sa:(x1,x2,,x9)(y1,y2,,y9) is an approximation to the real world. Different layers are consequences of different actions made on the state (x 1,x 2,…,x 9). At each layer, 100 possible status of a patient are considered according to top transition probabilities. This process repeats itself at every state of each layer. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 7
Figure 7
Each subfigure demonstrates approximate probability distribution of a predictor x i , i = 1∼9 by its histogram. Blue‐shaded areas represent the distributions of the original data, while the green‐shaded areas represent the distributions of the generated data by GAN. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 8
Figure 8
The mean accuracy of each predictor: (y 1,y 2,…,y 9) = (0.88,0.65,0.93,1.00,0.55,0.99,0.83,0.98,0.49) with an error bar visualized. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 9
Figure 9
Total rewards of episodes collected by the DQN in two different tasks, where a classical game MountainCar is compared with the artificial environment reconstructed for the dose adaptation. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 10
Figure 10
Automated dose decisions given by DQN (green dots) vs. clinical decision (blue dots) with RMSE 0.76 Gy. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 11
Figure 11
Automated dose decisions given by DQN (black solid line) vs. clinical decision (blue dashed line) with RMSE = 0.5 Gy. An evaluation of good (green dots), bad (red dots), and potentially good decisions (orange dots) are labeled according to Table 2. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 12
Figure 12
The comparison on 34 patients of protocol 2007‐123 divided into groups with and without adaptation. FX1 denotes the dose/frac given in the first 2/3 period of the treatment; clinical denotes that of the last 1/3 treatment based on protocol 2007‐123 compared with the DQN results using modified reward (20). Note that some patients in FX1 on the left figure were given maximum dose 2.85 Gy/frac according to protocol in Section 2.G.1. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 13
Figure 13
Automated dose decisions given by TD(0) method (green dots) vs. clinical decision (blue dots) with RMSE ≃ 3.3 Gy. Possible reasons for the TD method failure to mimic the clinical decisions is the convergence instability. [Color figure can be viewed at wileyonlinelibrary.com]

References

    1. Jaffray DA. Image‐guided radiotherapy: from current concept to future perspectives. Nat Rev Clin Oncol. 2012;9:688–699. - PubMed
    1. Eisbruch A, Lawrence TS, Pan C, et al. Using FDG‐PET acquired during the course of radiation therapy to individualize adaptive radiation dose escalation in patients with non‐small cell lung cancer.
    1. Kong F, Ten Haken RK, Schipper M, et al. Effect of midtreatment PET/CT‐adapted radiation therapy with concurrent chemotherapy in patients with locally advanced nonsmall‐cell lung cancer: a phase 2 clinical trial. JAMA Oncol. 2017;3:1358. - PMC - PubMed
    1. Bradley JD, Paulus R, Komaki R, et al. Standard‐dose versus high‐dose conformal radiotherapy with concurrent and consolidation carboplatin plus paclitaxel with or without cetuximab for patients with stage iiia or iiib non‐small‐cell lung cancer (RTOG 0617): a randomised two‐by‐two factorial phase 3 study. Lancet Oncol. 2015;16:187–199. - PMC - PubMed
    1. Naqa IEl, Li R, Murphy MJ, et al. Machine learning in radiation oncology. In: Theory and Applications. New York, NY: Springer International Publishing.