We refer to the practice of using the same data to fit and validate a model as double dipping. Problems arise when standard statistical procedures for validating models are applied in settings that involve double dipping. To circumvent the challenges associated with double dipping, one approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in many unsupervised problems, sample splitting does not allow us to avoid double dipping.
In this talk, we are motivated by unsupervised problems that arise in the analysis of single cell RNA sequencing data. We first propose Poisson count splitting, which splits a single observation drawn from a Poisson distribution into two independent components. We show that Poisson count splitting provides an alternative to sample splitting that allows us to avoid double dipping in unsupervised settings. As single-cell RNA sequencing data is often thought to be overdispersed relative to the Poisson distribution, we next propose negative binomial count splitting, which allows us to avoid double dipping under a more realistic and more general negative binomial assumption. Finally, we generalize the count splitting framework to a variety of distributions, and refer to the generalized framework as data thinning. Data thinning is a very general alternative to sample splitting that is useful far beyond the context of single-cell RNA sequencing data, and, unlike sample splitting, can be applied in both supervised and unsupervised settings.