Body

Positive selection favors alleles advantageous for an organism's viability and fertility. This phenomenon may occur via selective sweeps, where a beneficial new mutation rapidly increases in frequency within a population. The selection coefficient parameterizes the gradient of the allele frequency time series. Modeling adaptive evolution remains a major goal in population genetics, and yet limited statistical inference exists for estimators of the selection coefficient when only contemporary genetic sequences are available. Some methods address specific aspects of the selective sweep model, but these approaches make different assumptions and do not provide uncertainty quantification.

We present a unifying framework for a complete analysis of selective sweeps that connects shared haplotypes to a statistical model for recent ancestry. Based on a conditional coalescent model, we develop a suite of new methods to identify and localize putative regions of selective sweeps, infer the frequency of an unknown causal allele, and estimate the selection coefficient. Importantly, our estimator for the selection coefficient is unbiased and attains proper coverage in simulations. We further demonstrate that our estimator is robust to a variety of simulated model misspecification scenarios. Second, we simulate the evolution of genetic sequences within a population to evaluate the performance of our entire method across multiple tasks and in comparison with existing methods. Third, we study positive selection in two independent European American cohorts from the NHLBI Trans-Omics for Precision Medicine Project. We find results in line with prior literature for a positive control case of lactase persistence in the LCT gene. For a few genes, we estimate an allele frequency and selection coefficient pair and extrapolate a time trajectory with recursion; for other genes, we find that sequence variation and haplotype structure poorly fit those expected for hard selective sweeps.

To quantify statistical uncertainty in our estimator, we develop a fast parametric bootstrap to sample shared haplotypes from an unknown ancestral process. Naïve implementations of this bootstrap compare recombination endpoints pairwise, which becomes intractable for biobank scaled sample sets. We propose improvements to this algorithm which lead to approximately linear runtime in empirical studies. Our contribution also offers an asymptotic Poisson approximation for early time steps in which the coalescent assumption of small sample size is violated. We conclude with a discussion on this ongoing work and other extensions to our methodology.