Body

Contextual bandit problems model the inherent tradeoff between exploration and exploitation in personalized decision making in marketing, healthcare, revenue management, and beyond. Specifically, the tradeoff is characterized by the optimal growth rate of the regret in cumulative rewards compared to the optimal policy we are trying to learn, and naturally the optimal rate depends on how complex the underlying supervised learning problem is, namely how much can observing reward in one context tell us about mean rewards in another context. Curiously, this obvious-seeming relationship is obscured in current theory that separately studies the easy, fully-extrapolatable case and the hard, super-local case. So, to characterize the relationship more precisely I study a nonparametric contextual bandit problem where the expected reward functions are β-smooth (roughly meaning, they are β-times differentiable). I will show how this interpolates between two extremes that were previously studied in isolation: non-differentiable-response bandits (β ≤ 1), where rate-optimal regret is achieved by decomposing the problem into separate non-contextual bandits, and parametric-response bandits (β = ∞), where rate-optimal regret can often be achieved with minimal or no exploration due to infinite extrapolatability from one context to another. We develop a novel algorithm that works for any smoothness setting in between by operating neither fully locally nor fully globally. We prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes and filling in the gap. In this sense, our work bridges the regret gap between the non-differentiable and parametric bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in dynamic decision making. Time permitting, I will also discuss how to construct valid confidence intervals from data collected by contextual bandits, a crucial challenge in the enterprise to replace randomized trials with adaptive experiments in applied fields from biostatistics to development economics. To solve it, I will present the first asymptotically normal estimator of causal effect given data collected by a contextual bandit.

This talk is based on the following papers:
Smooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes
https://arxiv.org/abs/1909.02553
Post-Contextual-Bandit Inference
https://arxiv.org/abs/2106.00418