Body

This thesis studies two major problems in unsupervised learning: manifold learning and clustering. The motivation of this research is to establish mathematically rigorous methods that enable practitioners to have better understanding of what the algorithm is doing, even if there is no ground truth label for unsupervised learning problems. Specifically, we propose two criterion for a useful unsupervised learning paradigm: interpretability and stability. In this talk, we will mainly focus on the stability issue of clustering.

We introduce the stability of clustering to quantitatively validate a clustering result so that it is possible for practitioners to avoid these unwanted phenomena. Our target is to establish a generic notion (D, ε)−stability and show how this can be applied to real statistical tasks. In the hard clustering setting, we quantify population stability with respect to K-means clustering as a quantity for an arbitrary population P. With very mild assumptions on P, we show this quantity of P relates to that of a finite sample drawn from P. We also develop an algorithm to compute an upper bound of stability metric with respect to K-means clustering. As a byproduct, it provides an upper bound on the discrepancy between the global optimal K-means clustering assignment with the computed ones.

In the soft clustering setting, we focus on model-based clustering through fitting mixtures of spherical Gaussians (sGMM). Fitting sGMM is essentially a parameter estimation problem, and clus- tering assignments are based on the estimation. This thesis discusses mainly the parametric stability of sGMM: We show that if any two sGMMs are close, then their parameters are pairwise close. This result is proved with different assumptions on the model class of sGMMs. We can also see from numeric example that with the assumptions on the separation of different components in a Gaussian mixture, we obtain a precise upper bound on the parameter distances.