. 2021 Dec:34:7460-7471.

Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang¹, Lucas Janson², Susan A Murphy³

Affiliations

¹ Department of Computer Science, Harvard University.
² Departments of Statistics, Harvard University.
³ Departments of Statistics and Computer Science, Harvard University.

PMID: 35757490
PMCID: PMC9232184

Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang et al. Adv Neural Inf Process Syst. 2021 Dec.

. 2021 Dec:34:7460-7471.

Authors

Kelly W Zhang¹, Lucas Janson², Susan A Murphy³

Affiliations

¹ Department of Computer Science, Harvard University.
² Departments of Statistics, Harvard University.
³ Departments of Statistics and Computer Science, Harvard University.

PMID: 35757490
PMCID: PMC9232184

Abstract

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators-which includes estimators based on empirical risk minimization as well as maximum likelihood-on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.

PubMed Disclaimer

Figures

**Figure 3:. Poisson Rewards:**
Empirical coverage probabilities for 90% confidence ellipsoids for parameters $θ^{*} (P)$ and parameters $θ_{1}^{*} (P)$ (top row). We also plot the volumes of these 90% confidence ellipsoids for $θ^{*} (P)$ and parameters $θ_{1}^{*} (P)$ (bottom row). We set the true parameters to $θ^{*} (P) = [0.1, 0.1, 0.1, 0, 0, 0]$ (left) and to $θ^{*} (P) = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ (right).

**Figure 4:**
Empirical coverage probabilities (upper row) and volume (lower row) of 90% confidence ellipsoids. In these simulations, $θ^{*} (P) = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ . The left two columns are for the linear reward model setting (t-distributed rewards) and the right two columns are for the logistic regression model setting (Bernoulli rewards). We consider confidence ellipsoids for all parameters $θ^{*} (P)$ and for advantage parameters $θ_{1}^{*} (P)$ for both settings.

**Figure 5:**
Mean squared error estimators of $θ^{*} (P)$ for linear model (top), logistic regression model (middle), and generalized linear model for Poisson rewards (bottom). We consider simulations with $θ^{*} (P) = [0.1, 0.1, 0.1, 0, 0, 0]$ (left) and simulations with $θ^{*} (P) = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ (right).

**Figure 6:**
Above we plot the mean squared errors for the adaptively-weighted least squares estimator with evaluation policies: (1) uniform evaluation policy which selects actions uniformly from $A$ and (2) expected $π_{t} (a, H_{t - 1})$ evaluation policy for which $π_{t}^{sta} (a) = E_{P, π} [π_{t} (a)]$ (oracle quantity). In a two arm bandit setting we perform Thompson Sampling with standard normal priors, 0.01 clipping, $θ^{*} (P) = [θ_{0}^{*} (P), θ_{1}^{*} (P)] = [0, 1]$ , standard normal errors, and T = 1000. Error bars denote standard errors computed over 5,000 Monte Carlo simulations.

**Figure 7:**
Above we plot empirical allocations, $\frac{1}{T} \sum_{t = 1}^{T} A_{t}$ , under both Thompson Sampling (standard normal priors, 0.01 clipping) and ϵ-greedy (ϵ = 0.1) under zero margin $θ_{0}^{*} (P) = θ_{1}^{*} (P) = 0$ . For our simulations T = 100, errors are standard normal, and we use 50k Monte Carlo repetitions.

**Figure 8:**
Above we construct confidence intervals for $θ_{1}^{*} (P) - θ_{0}^{*} (P)$ using a normal approximation for the OLS estimator. We compare independent sampling (*π_t* = 0.5) and TS Hodges, both with standard normal priors, 0.01 clipping, standard normal errors, and T = 10, 000. We vary the value of $θ_{1}^{*} (P) - θ_{0}^{*} (P)$ in the simulations to demonstrate the non-uniformity of the confidence intervals.

**Figure 1:**
The empirical distributions of the weighted and unweighted least-squares estimators for $θ_{1}^{*} (P) ≔ E_{P} [Y_{t} (1)]$ in a two arm bandit setting where $E_{P} [Y_{t} (1)] = E_{P} [Y_{t} (0)] = 0$ . We perform Thompson Sampling with $N (0, 1)$ priors, $N (0, 1)$ errors, and T = 1000. Specifically, we plot $\sqrt{\sum_{t = 1}^{T} A_{t}} ({\hat{θ}}_{T, 1}^{OLS} - θ_{1}^{*} (P))$ on the left and $(\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \sqrt{\frac{0.5}{π_{t} (1)}} A_{t}) ({\hat{θ}}_{T, 1}^{AW-LS} - θ_{1}^{*} (P))$ on the right.

**Figure 2:**
Empirical coverage probabilities (upper row) and volume (lower row) of 90% confidence ellipsoids. The left two columns are for the linear reward model setting (t-distributed rewards) and the right two columns are for the logistic regression model setting (Bernoulli rewards). We consider confidence ellipsoids for all parameters $θ^{*} (P)$ and for advantage parameters $θ_{1}^{*} (P)$ for both settings.

See this image and copyright information in PMC

References

1. Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
1. Agrawal Shipra and Goyal Navin. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013.
1. Agresti Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.
1. Bibaut Aurélien, Chambaz Antoine, Dimakopoulou Maria, Kallus Nathan, and Laan Mark van der. Post-contextual-bandit inference. NeurIPS 2021, 2021. - PMC - PubMed
1. Bubeck Sébastien, Munos Rémi, and Stoltz Gilles. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Statistical Inference with M-Estimators on Adaptively Collected Data

Affiliations

Statistical Inference with M-Estimators on Adaptively Collected Data

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources