By modeling documents as mixtures of topics, Topic Modeling allows the discovery of latent thematic structures within large text corpora, and has played an important role in natural language processing over the past decades. Beyond text data, topic modeling has proven itself central to the analysis of microbiome data, population genetics, or, more recently, single-cell spatial transcriptomics. Given the model’s extensive use, the development of estimators — particularly those capable of leveraging known structure in the data — presents a compelling challenge.

In this talk, we focus more specifically on the probabilistic Latent Semantic Indexing model, which assumes that the expectation of the corpus matrix is low-rank and can be written as the product of a topic-word matrix and a word-document matrix. Although various estimators of the topic matrix have recently been proposed, their error bounds highlight a number of data regimes in which the error can grow substantially — particularly in the case where the size of the dictionary p is large.

In this talk, we propose studying the estimation of the topic-word matrix under the assumption that the ordered entries of its columns rapidly decay to zero. This sparsity assumption is motivated by the empirical observation that the word frequencies in a text often adhere to Zipf’s law. We introduce a new spectral procedure for estimating the topic-word matrix that thresholds words based on their corpus frequencies, and show that its ℓ1-error rate under our sparsity assumption depends on the vocabulary size p only via a logarithmic term. Our error bound is valid for all parameter regimes and in particular for the setting where p is extremely large; Our procedure also empirically performs well relative to well-established methods when applied to a large corpus of research paper abstracts, as well as the analysis of single-cell and microbiome data where the same statistical model is relevant but the parameter regimes are vastly different.