Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022;23(250):https://www.jmlr.org/papers/v23/21-0354.html.

Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials

Affiliations

Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials

Daiqi Gao et al. J Mach Learn Res. 2022.

Abstract

Learning optimal individualized treatment rules (ITRs) has become increasingly important in the modern era of precision medicine. Many statistical and machine learning methods for learning optimal ITRs have been developed in the literature. However, most existing methods are based on data collected from traditional randomized controlled trials and thus cannot take advantage of the accumulative evidence when patients enter the trials sequentially. It is also ethically important that future patients should have a high probability to be treated optimally based on the updated knowledge so far. In this work, we propose a new design called sequentially rule-adaptive trials to learn optimal ITRs based on the contextual bandit framework, in contrast to the response-adaptive design in traditional adaptive trials. In our design, each entering patient will be allocated with a high probability to the current best treatment for this patient, which is estimated using the past data based on some machine learning algorithm (for example, outcome weighted learning in our implementation). We explore the tradeoff between training and test values of the estimated ITR in single-stage problems by proving theoretically that for a higher probability of following the estimated ITR, the training value converges to the optimal value at a faster rate, while the test value converges at a slower rate. This problem is different from traditional decision problems in the sense that the training data are generated sequentially and are dependent. We also develop a tool that combines martingale with empirical process to tackle the problem that cannot be solved by previous techniques for i.i.d. data. We show by numerical examples that without much loss of the test value, our proposed algorithm can improve the training value significantly as compared to existing methods. Finally, we use a real data study to illustrate the performance of the proposed method.

Keywords: Contextual bandit; empirical process; martingale; outcome weighted learning; sequential decision making.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
The randomization probability PAi=𝒟^i-1XiHi-1,Xi of SRAT-E, SRAT-B and LinUCB when ϵi=0.05 and γi=0.4.
Figure 2:
Figure 2:
Scenario 1. The regret (logarithmic scale) and the false decision ratio on the training or test set against sample size n.
Figure 3:
Figure 3:
The weighted sum of training and test regrets in scenario 1 when n=800.
Figure 4:
Figure 4:
Scenario 1 with ϵ0=0.5. The regret (logarithmic scale) and the false decision ratio on the training or test set against parameter θ.
Figure 5:
Figure 5:
Sample size consideration for SRAT-E in scenario 1 with ϵ0=0.5. Correct decision ratios on the test set against that on the training set. Each line represents a sample size n and each point on the line represents a value of θ. Points to the right correspond to smaller θ, and thus lead to higher correct decision ratio on the training set and lower ratio on the test set.
Figure 6:
Figure 6:
Mean cross-validated HRSD scores against the sample size n.
Figure 7:
Figure 7:
Scenario 2. The regret (logarithmic scale) and the false decision ratio on the training or test set against sample size n.

References

    1. Auer Peter. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
    1. Bae Jongsig and Levental Shlomo. Uniform CLT for Markov chains and its invariance principle: a martingale approach. Journal of Theoretical Probability, 8(3):549–570, 1995.
    1. Bartlett Peter L, Jordan Michael I, and McAuliffe Jon D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
    1. Bastani Hamsa and Bayati Mohsen. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
    1. Bousquet Olivier. A Bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6):495–500, 2002.

LinkOut - more resources