Body

Currently, machine learning (ML) and artificial intelligence (AI) systems are driven by a clever technique known as bivariate self-supervised learning (SSL), wherein a highly expressive “foundation” model learns relationships between pairs of pre-training data (e.g., images and captions). Remarkably, the models produced by SSL can be combined and reused to solve downstream tasks such as image classification, without ever seeing directly labeled training data—a capability known as zero-shot prediction (ZSP). ZSP is made possible by “prompting”, or translating the downstream labels into natural language descriptions that can be jointly embedded into a Euclidean space with the to-be-classified images. In this defense, I will analyze various aspects of bivariate SSL from a statistical perspective.

A persistent theme will be a particular singular value decomposition of the conditional mean function of one variable given the other, which relates to the Lancaster decomposition, raking ratio estimation, and the alternating conditional expectations method of Breiman and Friedman. The spectrum of this decomposition will appear in two parts of this defense. First, I will show that marginal balancing, a pre-training data curation procedure used in SSL, results in improved performance of linear plug-in estimators by way of non-asymptotic mean squared error bounds. The bound recovers the known efficient asymptotic variance, but with a novel, exact formula based on the aforementioned spectrum. Second, I will show that ZSP is equivalent to a two-stage learning procedure, for which the same spectrum (in particular, the singular decay) determines the global prediction performance. In tandem, these works help shed a statistical light on popular, yet mysterious, practices in modern ML/AI.