Estimation and inference of optimal policies
This talk provides methods for learning and making inferences about optimal policies.
In my first project, I study the problem of learning an optimal policy in a contextual bandit setting. Contextual bandits provide a useful framework for incorporating additional information into sequential decision-making settings. They have been effectively applied in domains such as marketing, treatment allocation in biomedical sciences, and robotics. Given a policy class, I provide the first computationally efficient algorithm with matching instance-dependent upper and lower probably approximately correct (PAC) bounds for returning a policy whose expected reward is close to the optimal policy.
In my second project, I provide a means to make inferences in an offline policy learning setting. I focus on cases where there are multiple outcomes observed after playing some action. Among these outcomes, one is of primary interest, and the optimal policy is learned based on this primary outcome. Currently, practitioners evaluate impacts on other subsidiary outcomes via ad-hoc inferential procedures. In this work, I provide principled approaches to make inferences about these impacts. Specifically, I propose a means to construct confidence intervals under certain margin conditions, and a general uniform confidence band approach that does not require these conditions.