# Statistical Divergences for Learning and Inference: A Non-Asymptotic Viewpoint

Statistical divergences have been widely used in statistics and machine learning to measure the dissimilarity between probability distributions. This dissertation investigates their applications in statistical learning and inference. In the first part of this talk, I study the minimum Kullback-Leibler (KL) divergence estimation which is equivalent to the maximum likelihood estimation. It is well-known from the classical asymptotic theory that the properly centered and normalized estimator has a limiting Gaussian distribution with a sandwich covariance. I first establish a finite-sample bound for the estimator, characterizing its asymptotic behavior in a non-asymptotic fashion. An important feature of the bound is that its dimension dependency is characterized by the effective dimension -- the trace of the limiting sandwich covariance -- which can be much smaller than the parameter dimension in some regimes. I then illustrate how the bound can be used to obtain a confidence set whose shape is adapted to the optimization landscape induced by the loss function. In contrast to previous work which relied heavily on the strong convexity of the learning objective, I only assume the Hessian is lower bounded at optimum and allow it to gradually become degenerate. This property is formalized by the notion of pseudo self-concordance originating from convex optimization. Finally, I apply these techniques to semi-parametric estimation and derive state-of-the-art finite-sample bounds for double machine learning and orthogonal statistical learning.

The second part of this talk focuses on the Schrödinger bridge problem – an information projection problem in which a reference measure is projected onto a linear subspace of probability distributions in terms of KL. This problem formalized in that way by Föllmer is equivalent to the entropy-regularized optimal transport problem which recently attracted significant attention from the statistics and machine learning communities. The corresponding two-sample problem has been well-studied using tools from the empirical process theory. I explore the independence testing problem by introducing a new independence criterion inspired by the Schrödinger bridge problem. A statistic for testing independence can be derived by estimating this criterion from data. With tools from the U-process theory and the optimal transport theory, I establish a non-asymptotic bound to characterize the convergence of the test statistic, and I show that the associated test has asymptotic power one under fixed alternatives. To compute the test statistic, a direct application of the Sinkhorn algorithm, usually used to solve the Schrödinger bridge problem, leads to an algorithm that scales quartically in both time and memory. I design an efficient algorithm by exploiting random feature approximations, achieving a quadratic time complexity and a linear space complexity. Finally, I illustrate the interest of the proposed criterion on bilingual data.