Testing data-driven hypotheses
The reality of modern data analysis is that scientists often explore their data to generate hypotheses, and then test those hypotheses using the same data. When the same data are used to select and test a null hypothesis, standard hypothesis testing frameworks fail to control the selective Type 1 error: the probability of rejecting a true null hypothesis given that we decided to test it. We refer to the practice of using the same data to generate and test a null hypothesis as “double dipping”. In this talk, I present two projects that aim to help scientists account for or avoid double dipping in applied settings.
In the first project, we develop a computationally efficient selective inference framework that allows researchers to test for a difference in mean response between subgroups selected by a regression tree. By conditioning on the event that the subgroups were chosen for the regression tree, our framework accounts for double dipping and yields tests that control the selective Type 1 error.
The second project is motivated by circularity that arises in the analysis of single-cell RNA sequencing data when researchers first estimate latent structure in the data, and then test to see which genes are associated with this latent structure. We propose splitting the expression counts using binomial sampling, and show that, under a Poisson assumption, this flexible framework allows researchers to avoid double dipping and thus conduct tests that control the selective Type 1 error.